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ABSTRACT 

In  this  paper  we  present  a  unified  framework  for  simulating  Markovian  models  of  highly  depend¬ 
able  systems.  Since  the  failure  event  is  a  rare  event,  the  estimation  of  system  dependability  measures 
using  standard  simulation  requires  very  long  simulation  runs.  We  show  that  a  variance  reduction 
technique  called  Importance  Sampling  can  be  used  to  speed  up  the  simulation  by  many  orders  of 
magnitude  over  standard  simulation.  This  technique  can  be  combined  very  effectively  with  regen¬ 
erative  simulation  to  estimate  measures  such  as  steady-state  availability  and  mean  time  to  failure. 
Moreover,  it  can  be  combined  with  conditional  Monte  Carlo  methods  to  quickly  estimate  transient 
measures  such  as  reliability,  expected  interval  availability  and  the  distribution  of  interval  availability 
We  show  the  effectiveness  of  these  methods  by  using  them  to  simulate  large  dependability  models. 
We  also  discuss  how  these  methods  can  be  implemented  in  a  software  package  to  compute  both 
transient  and  steady-state  measures  simultaneously  from  the  same  sample  run. 
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I.  Introduction 


1  he  requirements  for  highly  dependable  systems,  such  as  air  traffic  control  and  transaction  proc¬ 
essing  systems,  increase  the  importance  of  dependability  prediction  at  a  design  stage.  Typically, 
stochastic  models  are  used  to  analyze  such  systems.  A  system  is  considered  to  be  a  collection  of 
components  which  can  fail  and  possibly  get  repaired.  The  system  is  considered  operational  if  at  any 
given  moment  the  operational  components  satisfy  some  minimum  system  operational  require¬ 
ments.  Many  details  of  failure  and  repair  behavior  of  the  components  have  been  introduced  in  such 
models:  common-mode  failures  in  [2,  5,  30],  detailed  fault  and  error  handling  models  in  [9]  and 
detailed  recovery  hierarchies,  operational  and  repair  dependences  in  [15,  17],  Models  which  include 
degraded  modes  of  operation  have  also  been  introduced  [17,40],  Different  measures  are  used  to 
evaluate  the  modeled  systems  depending  upon  whether  they  are  mission  oriented  systems  or  con¬ 
tinuously  operating  sy  stems.  Some  of  the  dependability  measures  of  interest  arc  steady-state  avail¬ 
ability.  reliability,  mean  time  to  failure,  expected  interval  availability  and  the  complementary 
distribution  or  interval  availability  (i.e..  the  probability  that  a  system  would  achieve  a  higher  interval 
availability  than  a  specified  value  between  0  and  I.)  Similar  measures  have  also  been  constructed 
for  degradable  jystems,  e  g.,  steady-state  performance  and  distribution  of  performance  over  a  time 
interval  [32].  Detailed  surveys  of  these  modeling  techniques  and  the  dependability  measures  calcu¬ 
lated  appear  in  [12,31], 


The  most  common  stochastic  models  used  in  this  context  are  continuous-time  Markov  chains 
(CTMCs).  Typically,  numerical  methods  are  used  to  solve  Markov  chains.  Although,  many 
modeling  packages  have  been  built,  c.g.,  [17]  and  [9],  which  incorporate  numerical  methods  ca¬ 
pable  of  computing  steady-state  as  well  as  transient  state  probabilities  of  Markov  chains  with 
thousands  of  states,  the  size  of  system  modeled  is  typically  small  because  the  number  of  states  in 
the  system  increases  exponentially  with  the  number  of  components.  Techniques  like  state  lumping 
and  unlumping  [16,35]  and  state  aggregation  and  bounding  [1,  34]  can  reduce  the  size  of  the  state 
space  substantially.  However,  large  systems  with  a  large  number  of  redundant  components  arc  still 
out  of  the  range  of  the  solution  capabilities  of  current  numerical  methods,  primarily  due  to  storage 
or  computational  limitations. 

An  alternate  approach  for  the  solution  of  large  models  is  Monte  Carlo  simulation,  which  is  the 
subject  of  this  paper.  Simulation  is  especially  useful  for  those  models  for  which  the  transition  rate 
matrices  exceed  the  available  storage.  By  nature,  this  approach  has  the  immediate  advantage  of 
having  relatively  small  storage  requirements.  On  the  other  hand,  since  the  failure  events  arc  rare 
events,  it  is  apparent  that  the  analysis  by  simulation  of  large  models  with  a  high  degree  of  redun¬ 
dancy  will  require  many  regenerative  cycles  or  many  long  independent  replications  in  order  to  attain 
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reasonable  confidence  intervals  [12,  2() J.  Our  goal  is  to  obtain  variance  reduction  methods  that  are 
applicable  to  a  broad  class  of  models.  Specifically,  we  are  interested  in  models  defined  by  the  reli¬ 
ability  and  availability  modeling  language  described  in  [17],  so  that  the  techniques  can  be  imple¬ 
mented  in  a  software  package  and  made  available  to  designers  in  an  automatic  and  transparent 
fashion.  A  typical  system  contains  multiple  component  types  with  redundant  units  for  each  com¬ 
ponent  type.  Failure  of  these  systems  is  usually  caused  by  exhaustion  of  redundancy  or  by  a 
combination  of  component  failures  leading  to  a  system  failure,  bailed  components  may  be  repaired. 
If  all  components  are  repairable  and  component  failures  are  exponential,  then  a  regenerative  state 
for  the  system  (see,  e  g  .  [X])  is  the  state  where  all  units  of  all  component  types  are  operational. 
If.  in  addition,  all  repair  time'  am  exponentially  distributed,  then  the  system  cdti  be  modeled  by  a 
continuous  time  Markov  chain.  For  highly  reliable  and  highly  available  systems,  it  is  usual  for  the 
repair  recovery  rates  of  components  to  be  orders  of  magnitude  larger  than  the  failure  rates,  and  in 
these  circumstances  the  use  of  importance  sampling  variance  reduction  techniques  [20,  22]  can  be 
very  effective  in  reducing  the  simulation  run  length  significantly 

Importance  sampling  for  rare  event  simulation  has  been  used  successfully  in  [h],  [28],  [42],  [44] 
and  [38],  Proper  selection  of  the  importance  sampling  distribution  makes  lire  rare  events  more 
likely  to  occur;  this  results  in  a  variance  reduction.  The  key,  of  course,  is  to  choose  a  good  im¬ 
portance  sampling  distribution.  The  theory'  of  large  deviations  was  used  in  [6],  [42]  and  [44,  38] 
to  select  an  effective  distribution  for  problems  arising  in  Markov  chains  with  “small  increments", 
random  walks,  and  queueing  networks,  respectively.  Hflcetivc  heuristics  were  used  in  [28]  to  select 
importance  sampling  distributions  for  reliability  estimation  in  large  models  of  machine  repairman 
type. 

In  this  paper,  we  review  and  extend  the  ods  in  [4],  [18]  and  [41]  and  present  a  comprehensive 
and  unified  framework  for  simulating  a  1  ■  d  class  of  models,  specifically  models  defined  by  the 
reliability  and  availability  modeling  language  described  in  [17],  The  language  is  used  to  describe 
failure  and  repair  behavior  of  components  as  well  as  operational,  repair,  recovery  anil  common- 
inode  failure  dependencies  among  them.  The  language  is  also  used  to  describe  conditions  (e  g., 
reliability  block  diagram  or  fault-tree)  under  which  the  system  is  considered  operational.  The  model 
described  by  the  language  is  simulated  assuming  that  all  failure,  repair  and  recovery  distributions 
arc  exponential.  We  estimate  both  steady-state  and  transient  measures  simultaneously  from  the 
simulation.  Importance  sampling  techniques  arc  used  to  estimate  these  measures;  these  techniques 
are  orders  of  magnitude  faster  than  ordinary  simulation.  The  basic  idea  behind  importance  sam¬ 
pling  is  described  in  Section  2.  We  also  give  formal  definitions  of  the  dependability  measures  of 
interest  in  this  section. 


In  Section  '  we  present  our  methods  (or  estimating  dependabilit v  measures,  such  as  the  steady-state 
availability  and  the  mean  time  to  failure  (M  i  l  l  ).  1  he  estimators  are  based  on  combining  regen¬ 
erative  simulation  [S]  with  importance  sampling.  The  concepts  of  dynamic  importance  sampling 
(IMS.  see  [4])  and  measure  specific  dynamic  importance  sampling  (MS1MS,  see  [IN])  are  explained 
using  a  very  simple  three  state  example.  Direct  application  of  these  techniques  docs  not  yield  a 
significant  variance  reduction  for  the  M  i  l  l  However,  when  the  M  i  l  l  is  formulated  as  a  ratio 
of  two  expectations  (both  are  estimated  using  a  regenerative  simulation),  then  significant  variance 
reductions  can  be  achieved  using  our  importance  sampling  techniques  (see  also  [41].)  Therefore, 
while  the  M  i  l  l  is.  in  fact,  a  transient  measure,  we  can  estimate  it  using  a  regenerative  simulation; 
this  is  the  reason  why  we  have  considered  its  estimation  with  other  steady-state  measures  in  Section 
.V  The  equations  for  optimal  run-length  allocation  and  asymptotic  bias  expansions  arc  also  given. 

In  Section  4  we  present  our  methods  for  estimating  the  transient  measures,  such  as  reliability,  in¬ 
terval  availability  and  distribution  of  interval  availability.  The  estimators  arc  obtained  by  inde¬ 
pendently  replicating  observations  based  on  combining  conditioning'  (e.g..  [10.  II])  or  "forcing' 
(e  g  .  [2N]|  methods  with  importance  sampling.  In  Section  5  we  show  how  we  implemented  both 
regenerative  simulation  and  the  independent  replications  so  that  steady-state  and  transient  measures 
can  be  computed  simultaneously  from  the  same  sample  run.  Some  implementation  issues  as  well 
as  theoretical  issues  in  using  importance  sampling  arc  also  described  in  this  section. 

In  Section  6  we  show  the  effectiveness  of  the  above  techniques  in  reducing  the  simulation  run  time 
of  a  large  example.  Typically,  we  obtain  orders  of  magnitude  reduction  in  the  run  time  over 
standard  simulation.  We  also  perform  coverage  experiments  on  the  confidence  intervals  and  com¬ 
pute  the  bias  values  for  the  estimates  of  the  steady-state  availability.  I  anally,  in  Section  7  we  give 
concluding  remarks  and  suggest  some  directions  for  future  research. 
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2.  Background  and  Notation 


In  thi.~  section,  wo  review  the  basic  ideas  of  importance  sampling  We  include  this  background 
material  to  make  the  paper  self  contained  and  more  accessible  to  the  non-specialist.  A 
continuous-time  Markov  chain  (('  I  NK!)  model  of  s\  stems  is  then  introduced  and  the  associated 
measures  that  arc  of  primary  interest  in  evaluating  highly  available  systems  arc  defined. 


2.1  Review  of  Importance  Sampling 

The  basic  notion  behind  importance  sampling  can  be  illustrated  using  a  simple  example:  estimating 
the  expected  value  of  a  function  of  a  random  variable  (sec,  e.g.,  [20]).  Suppose  that  0  is  an  input 
parameter  to  the  simulation,  e.g.,  a  failure  rate.  Associated  with  each  0  is  a  probability  density 
function  (pdf)  ffl(x)  for  —  o->  <  ,v  <  on.  Suppose  we  wish  to  estimate  r{0)  =  1  ;„[/i(.Y )]  for  some 
function  h  where  the  subscript  0  indicates  that  the  random  variable  (rv)  X  is  sampled  from  the  pdf 
ffl(.x).  I  hen 


poo 

=  l^[AW]  =  j  h(x)f0(x)dx. 


Now  assuming  that  Pop{x)  is  another  probability  density  function  such  that  Pon(x)  >  0  for  all  ac. 
liquation  2. 1  can  be  written  as 


-  />( 


Ponix)dx  =  I  y)n[/t(  V)  I-(0,  fl„.  ,V)]. 


where  L(0,fln,x)  =fo(x)IPo^x)  is  called  the  likelihood  ratio.  Hquation  2.2  thus  provides  a  means 
to  produce  an  unbiased  estimate  of  r\0)  by  simulation  using  the  different  probability  density  func¬ 
tion  Pop(x).  'This  sw  itch  of  the  probability  density  function  is  called  a  change  of  measure;  the  re¬ 
sulting  simulation  algorithm  is  called  importance  sampling. 

1  or  our  purposes,  the  goal  of  this  change  of  measure  is  to  produce  an  estimate  with  lower  variance. 
In  fact,  if  h(x)  >  0  for  all  ec,  then  choosing  p0p(x)  =  h(x)f0(x)lr(0)  yields  a  zero  variance  estimator, 
since  in  this  case  the  rv  h(X)  (?n,  X)  takes  on  the  constant  value  t\0)  with  probability  one.  In 
practice,  however,  this  is  not  a  feasible  change  of  measure  since  it  requires  knowing  r{0),  the  un¬ 
known  measure  to  be  estimated.  Nevertheless,  we  will  find  the  zero  variance  transformation  useful, 
since  it  can  be  calculated  for  some  simple  examples  and  forms  the  basis  of  a  heuristic  for  simulating 
more  complex  examples  of  highly  available  systems. 

2  flackground  and  Notation  S 


We  next  specialize  these  results  to  rare  event  simulation,  anil  show  why  importance  sampling  can 
he  particularly  effective  lor  estimating  the  probability  of  rare  events  (this  argument  is  basically  taken 
from  [44]).  Suppose  that  /i(.v)  =  £(.v)  x  !(,,=  /]  for  some  function  g  where  l(,€/j is  the  indicator  of 
a  set  / ,  i.e  .  1  j  xe/ j  =  1  if  xr  /  and  I  ( xe/  ,  =  0  if  v<//.  We  assume  that  /  is  a  so-called  rare  set  in  the 
sense  that  the  probability  that  the  rv  Ae/'is  close  to  0.  A  single  sample  using  standard  simulation, 
i.e..  sampling  A  from  the  pdf  Jn{x)  and  fonning  the  estimate  ^(A)  x  |,Ag,.,  has  variance 
1  '((>)  =  l  ,U(AT  1 1 ,ve / } ]  —  rt//)2  (note  that  ljVe/.-)  =  ({av/)1-  O'1  °lhcr  hand,  the  variance  ob¬ 
tained  using  importance  sampling,  i.e.,  sampling  ,Y  from  the  pdf  /ifl  (.v)  and  forming  the  estimate 
g(.V)  x  I {ae/ }  x  1.(0,  0().  A‘)  is  I '(/)„)  =  h^jefY)2  I  jVe/  )  1.(0,  fln.  A-)2]  -  r(//)2.  Notice  that  if  the  like¬ 
lihood  ratio  l.{0.  0(),  .rK  1  for  all  xel\  then 


iw  «  l,.,w 0„.  A)]  -r(0)2 

=  ly,D?(Y)2  l{.Vef)]  -  r((/)2  -  !•(«) 


(2.3) 


where  the  first  equality  is  true  by  I  quation  2.2.  Thus  to  achieve  variance  reduction,  wc  need  to 
make  the  likelihood  ratio  very  small  on  the  rare  set  Since  /  is  a  rare  set,  fn{x)  is  typically  small 
for  .vs /’and  since  1.(0, 0n,  v)  =  fo(x)IPfl  (x),  we  obtain  a  small  likelihood  ratio  by  choosing  the  pa¬ 
rameter  00  so  that  Pnn(x)  is  relatively  large  for  xeE.  i.e.,  by  making  the  rare  set  more  likely  to  occur 
under  the  new  measure  defined  by  Pon(x).  This  intuitively  explains  the  basic  notion  behind  using 
importance  sampling  in  rare  event  simulation. 

2.2  Markov  Chains  and  Associated  Dependability  Measures 

We  assume  that  the  system  can  be  represented  by  a  CTMC  V  =  (  )  t,  s  >  0  }  with  finite  state  space 
E  and  infinitesimal  generator  matrix  Q  =  f  q( ij) ,  i,  j  e  E  }.  We  let  q(i)  =  —  q(i,i)  denote  the  total  rate 
out  of  state  /  (see,  e  g.,  [23]).  We  further  assume  that  E  can  be  partitioned  into  two  subsets: 
A  =  O  U  (  w  here  O  is  the  set  of  up  states,  i.e.,  the  set  of  states  for  which  the  system  is  operational, 
and  /  is  the  set  of  down,  or  failed  states.  We  assume  that  the  system  starts  out  in  the  state  for  which 
all  components  are  operational;  wc  label  this  state  as  state  0.  1'or  any  set  of  states  A,  let  x A  denote 
the  time  the  Markov  chain  first  enters  the  set  A.  Of  particular  interest  arc  i*0,  which  is  the  first  re¬ 
turn  tirr.  -  to  state  0,  and  xr,  which  is  the  first  entrance  time  into  the  subset  /  of  failed  sates. 

We  will  he  interested  in  two  types  of  dependability  measures  associated  with  the  CTMC  Y:  tran¬ 
sient  measures  and  so-called  steady-state  measures.  Considering  the  transient  measures  first,  the 
interval  availability,  A(t).  is  defined  by 
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2  Background  and  Notation 


./</)  =  ~  ['  I  (2.4) 

•'.v=0 

I  his  is  the  fraction  of  time  that  the  system  is  operational  in  the  time  interval  ((),/).  We  let 

l(D  =  I’T/fU)]  (2.5) 

he  the  expected  interval  availability  and  let 

I  (t, x)  =  Y{A(l)  <  -v  ]  (2.6) 

denote  the  distribution  of  availability.  I  he  reliability  of  the  system  is  defined  to  be  the  probability 
that  the  system  does  not  fail  in  the  interval  (()./): 

m  =  n«r>']  =  i  [i (2.7) 

I  or  steady-state  measures  we  assume  that  Y  is  irreducible,  in  which  case  )\=*  )  as  s  -*  oo,  where  => 
denotes  convergence  in  distribution  and  Y  is  a  rv  having  the  steady-state  distribution 
n  =  {i Tj .  ieF  }  (rr  solves  the  equations  nQ  —  0).  Notice  that  steady-state  measures  are  independent 
of  the  starting  state  of  the  system,  however,  we  will  choose  the  fully  operational  state  (i.c.,  state  0) 
to  define  a  regenerative  state  for  the  system.  By  regenerative  process  theory  (see  [43]  or  [KJ). 
steady-state  measures  take  the  form  of  a  ratio  of  two  expected  values: 


r  =  r.[/0)J  =  lim  I  •[/[)',)] 

/-♦DO 


f^n 

H  /?>■,)*] 

Js=0 _ 

•;[«n] 


(2.8) 


where  /is  a  real  valued  function  on  F..  If  f[i)  =  1  j,-6o)-  then  !'[/( )t]  is  the  long  run  fraction  of  time 
the  system  is  operational  and  is  called  the  steady-state  availability,  which  we  denote  by 
A  =  lim  I  [/!(/)].  We  will  sometimes  find  it  convenient  to  consider  the  expected  unavailability 
/’(/)=  1  —/(/)=  1  —  F[/f(0]  and  the  steady-state  unavailability,  (  =  1  —  A ,  The  problem  of 
steady-state  estimation  thus  reduces  to  one  of  estimating  the  ratio  of  two  expected  values. 

The  mean  time  to  failure  (MTTI:),  F;[<x/r],  is  typically  thought  of  as  a  transient  measure,  since  it 
depends  on  the  starting  state  of  the  system  (state  0)  which  is  assumed  to  be  the  fully  operational 
state  A  ratio  representation  for  ITstr]  is  found  to  be  particularly  useful.  To  derive  this  ratio,  we 
write 
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Now,  applying  the  Markov  property  at  time  rn  shows  that,  on  the  set  { ■»,,  <  (or /  —  •»,,)  is  con¬ 
ditionally  imlependent  of  l{,n<,  j  and  furthermore  has  the  same  distribution  as  o/..  Therefore, 
taking  expected  values  of  I  quation  2.9  and  rearranging  terms  yields  the  ratio  formula 
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I T  mill  (or,  .  or,,)  ] 


V 
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(2.1(1) 


Thus  we  can  view  estimating  IT’rl  as  a  ratio  estimation  problem,  where  bofli  the  numerator  and 
the  denominator  arc  estimated  using  a  regenerative  simulation.  Therefore,  in  Section  2  we  consider 
the  estimation  of  the  mean  time  to  failure  (M  i  l  l  )  together  with  steady-state  measures  which  arc 
also  (and  more  commonly)  estimated  using  regenerative  simulations. 
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3.  Estimating  Steady-State  Measures 


In  this  section,  we  discuss  the  estimation  of  steads -state  measures  of  (  i  \K's  We  begin  In  re¬ 
viewing  how  this  problem  can  he  reduced  to  estimating  associated  steads -state  measures  in 
discrete-time  Markov  chains  (Dl  MCsl  and  describe  the  regenerative  method  of  simulation  (see. 
eg.  [S]t.  We  next  describe  the  application  of  importance  sampling  to  DlMC’s.  In  particular,  we 
note  that  the  importance  sampling  transformation  selected  for  actual  simulation  can  be  dynamic  in 
the  sense  that  it  need  not  correspond  directly  to  a  time  homogeneous  l)T\K'.  We  also  note  that, 
since  the  problem  is  one  of  estimating  the  ratio  of  two  expected  values,  there  is  no  need  to  use  the 
same  importance  sampling  transformation  for  estimating  both  the  numerator  and  denominator  of 
this  ratio,  i.e..  the  importance  sampling  transformations  can  be  im  asiire  specific  Anaiv  sis  of  a  three 
state  example  emphasizes  the  benefit  of  both  dynamic  and  measure  specific  importance  sampling 
and  serves  as  the  basis  for  heuristics  for  larger,  more  complex  system  availability  models.  I  he  op¬ 
timal  allocation  of  C'l’l  time  to  estimation  of  the  nutneratot  and  denominator  is  then  diseased. 
I  ’he  section  concludes  by  considering  asymptotic  bias  expansions  of  the  estimators. 


3.1  Discrete  l  ime  Conversion  of  CTMCs 

In  [21]  it  is  shown  how  one  can  estimate  steady-state  measures  of  an  irreducible  ('  I  NK'  by  simu¬ 
lating  only  the  embedded  I)T\1C  (and  not  generating  random  holding  times).  let 
X  =  {,V„,  n  >  0|  denote  the  embedded  OTMC  of  the  (  T\1C  V:  X  has  transition  matrix 
I*  =  {/>(/,/)}  where  p(i.i)  —  0  and  />(/,/)  =  <?( /,/)/</( f)  for  j  =£  i  I  et  h(i)  —  I  l<j{i)  be  the  mean  bolding  time 
in  state  i  anil  let  g( t)  =  f\i)jq(t)  let  r  ,  be  the  first  entrance  time  of  the  DTMC  into  the  set  f  and 
let  Tf>  he  the  first  return  time  of  the  |VI  \K'  to  state  0.  I  hen 

r[P"/n>/v] 

I  L/f  > )]  =  lim  !  [/[>  ,)]  =  j -  -  -  (-VI) 

-  I  ■['»n]  '"_l 

I  [  £  AC  A  V>] 

t-^n 


lo  emphasize  the  dependence  of  this  ratio  on  the  transition  matrix  I’,  we  write  liquation  .VI  cx- 

Tn  — '  rn  -  I 

plieitly  as  r=  l  ,.[ (r]/f ,»[//]  where  G  =  £  g(Aft)  and  II  =  V  /t(.Vfr).  In  [21],  it  is  shown  that  this 

*•=0  k-n 

discrete  time  conversion  is  always  guaranteed  to  produce  a  variance  reduction  over  simulation  of 
the  original  CTMC  fox  and  Cilynn  [10]  have  extended  this  result  to  simulation  of  semi-Markov 
processes 
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1  ip  u.»>n  3  I  forms  the  basis  lor  the  regenerative  method  of  simulation  lor  (  IMC's  (see.  eg. 
f '■s ] )  One  simulates  (using  the  transition  matrix  I*)  m  (sax  )  independent  and  identical!)  distributed 
did)  replicates  of  the  random  vector  ( C » . / / ) .  yielding  the  iid  random  vectors  \{(ir  //,) :/  —  I.  ,m  }. 
I  ach  re|  lication  imolves  simulating  the  PiMC  \  (with  the  initial  condition  A..  —  II  )  to  time  r, , : 
these  replications  are  known,  in  the  simulation  literature,  as  regenerative  cycles  I  ct 

«i 

^ I* )  —  v{;.;V//:.  1  hen.  because  the  cxcles  are  iid.  lim  /•  ( I*)  =  r  with  probabilitv  one  and 

;!T? 

m\>„,  (1*1  V lit.  ct^(  I* )  / 1  pT  //]).  where  V(0,  <t~)  denotes  a  normally  distributed  random  \ari- 

able  \s  i  1  h  mean  zero  and  variance  a’ ,  and  n  'll’)  =  Var,.[f»;  -  rl/j], 

3.2  Importance  Sampling  for  DTMCs 

We  next  extend  the  change  of  measure  transformation  of  Iqnation  2  1  to  DI  NK's  I  et  r  be  any 
stopping  tune  of  the  1)1  \1C  \  ami  let  )  be  a  rv  defined  on  X  up  until  time  t.  Informally  ,  t  is  a 
stopping  time  if  the  event  { r  =  n)  is  determined  by  \(  =  (  VM,  .  A,,)  1  he  r\  )  is  then  a  (measur¬ 
able)  function  ol  \.  — .  j  \’, . \.)  (see  [23]  for  a  more  detailed  and  precise  treatment  of  stopping 

times)  The  first  entrance  time  to  a  state,  or  a  set  of  states,  is  a  stopping  time.  In  particular,  both 
r.  and  r i  arc  stopping  times.  I  ct  l„  denote  the  set  of  all  possible  sample  paths  up  until  time  n.  i.c  . 
.  f,,  =  fs,.  =  ( v . e  /:'}.  lor  any  s,,  e  /•!„.  let 

l’<V  =  P(  s0)  />(?„.  r,  ,  ..rP...  />(<„_,.  x„)  (A.2) 

w  here  />(  t„)  is  the  probability  that  the  initial  state  is  rn  I  et  r  f„  be  the  set  of  sample  paths  for 
which  -r  —  n. 

Proposition  3.1 

I  et  r  be  a  stopping  time  which,  under  the  transition  matrix  P.  is  finite  with  probability  one  and  let 
/  be  a  (measurable)  function  of  X.  for  which  I  ,,[  I  /(X .)  I  ]  <  I  et  P'  be  any  other  measure  such 
that,  under  P'.  t  is  finite  with  probability  one  and  for  any  s„  •?  /?„.  P'(s„)  *  0  whenever 
/|s„)P(s„)  *  t)  Then 

lr[/(X.)J  -  lr[/fX.)/.',(XJl  (3.3) 

where  for  any  n.  /.' ,(X„>  -  P(X„)/P'(X„) 

Proof:  Since,  under  P.  r  is  finite  with  probability  one  and  since  /  has  a  finite  absolute  first  mo¬ 
ment.  we  can  write 
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in 


r,,[/(\jj  =  V  V  i*,s„)/(s„)  =  V  V  |>'(s>j) /(s>i, //l(s>i)  =  ir!/(X.)/',(Xr)]  (3.4) 

/i-u  s.,  e  H..  n— 0  s„  e  /?„ 

where  t he  last  equality  follows  since  r  is  finite  with  probability  one  under  P'.C] 

Versions  of  this  proposition  have  appeared  elsewhere,  c.g..  in  [44],  [3K]  or  [14],  Note  that  there 
can  exist  a  sample  path  s„  such  that  P(s„)  >  0  even  though  P'(s„)  =  0.  provided  that  X(s„)  =  0.  We 
emphasize  however  that  the  measure  I*'  does  not  have  to  correspond  to  a  time  homogeneous 
Markov  chain,  nor  even  that  it  corresponds  to  a  Markov  chain.  Indeed  we  will  see  that  it  is  highly 
advantageous  in  many  circumstances  for  P'  not  to  be  Markovian.  I  he  general  form  that  we  will 
consider  for  I*'  is 


P'(s,,)  =  P'(rn)  P’ffi  I -vf))  P'Di!  r().  h)  •  lv(v„|.Tn . v„_|).  (3.5) 

With  this  formulation,  we  have  the  freedom  to.  e  g.,  adjust  the  transition  probabilities  to  depend 
upon  the  number  ot  visits  the  chain  has  made  to  a  set  of  states  (say  the  failed  states)  or  simply  to 
turn  off"  the  importance  sampling  whenever  the  likelihood  ratio  gets  too  small,  thereby  avoiding 
numerical  problems.  Wc  term  the  use  of  such  an  importance  sampling  distribution  Dynamic  Im¬ 
portance  Sampling  (DIS). 

Applying  DIS  to  estimating  the  ratio  of  Equation  3.1  yields  the  following  procedure.  A  total  of  m 
iid  regenerative  cycles  of  the  DTMC  X  arc  simulated  using  the  DIS  distribution  I*'.  I  ct 
(>r  //,  and  l.'fi  be  the  samples  of  (7,  II  and  respectively,  from  cycle  j.  Define  the  point  estimate 

m  m 

r„,(P')  =  Y  f';  l-'\i  I Y.  llj  l-  \j-  Then,  as  in  the  case  without  importance  sampling,  we  have 

f  /=  i  ,/=  i  _  , 

lim  rm(P')  =  r  with  probability  one  and  Jm  (rm( P'  )  —  r)=>.V((),  «t‘(P')/I'.,»[//;]2)  where 

IT\  —*ry~  "  ■' 

a2(P')  =  Var,v[((7.  -  r //:)//,.] 

(3.6) 

=  Varp-t^- -  2  K  ovp, +  r2  Var,, ■[//,- 

Now  from  the  form  of  n  (!*'),  it  is  seen  that  selecting  a  good  DIS  distribution  P'  involves  taking 
three  terms  into  consideration.  I  or  example,  selecting  a  !*'  to  reduce  the  variance  of  the  estimate 
of  the  ratio's  numerator  may  actually  increase  the  variance  of  the  estimate  of  the  denominator,  or 
vice  versa.  In  addition,  the  effect  on  the  covariance  term  will  generally  be  difficult  to  control,  or 
even  predict  Thus  selection  of  a  single  importance  sampling  distribution  for  both  the  numerator 
and  denominator  involves  a  trade-off 

This  suggests  that,  since  wc  arc  really  trying  to  estimate  two  different  quantities,  we  should  use 
different  changes  of  measures  to  estimate  each  quantity  I  stimating  the  numerator  and  denomi- 

!  I 
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nator  independently  allows  one  to  tailor  the  importance  sampling  distributions  to  the  particular 
measure  being  estimated,  without  having  to  be  concerned  about  the  covariance  term  We  call  this 
Measure  Specific  Dynamic  Importance  Sampling  (MSDIS).  Section  3.3  provides  further  moti¬ 
vation  for  the  use  of  D1S.  as  well  as  MSDIS.  In  fact  for  the  example  given,  the  two  optimal,  i . e . , 
7ero  variance,  changes  of  measure  are  opposites  in  the  sense  that  numerator's  optimal  change  of 
measure  brings  the  system  very  quickly  to  the  failed  state,  whereas  the  denominator’s  optimal 
change  of  measure  is  approximately  the  same  as  the  original  measure  and  thus  brings  the  system 
only  very  slowly  to  the  failed  state. 

1  he  procedure  for  MSDIS  can  he  described  more  completely  as  follow  s.  I  et  I’’  and  I’"  denote  the 
DIS  distributions  for  the  numerator  and  denominator,  respectively  A  total  of  m  cycles  are  simu¬ 
lated  Assume  that  pm  cycles  of  the  numerator  are  simulated  and  (I  -  P)m  cycles  of  the  denomi¬ 
nator  are  independently  simulated  where  (></?<  I  (for  notational  simplicity,  assume  that  fim  is  an 
integer)  Define 


pm 

YGjl.'unpm) 
J=  I 


(1  -P)m 

X  //;  /•")//([  1  -  n>") 

7=1 


(3.7) 


Note  that  is  actually  independent  of  //y  / even  though  they  have  the  same  subscript. 

Then.  as  before,  lim  ?_,(P\  P")  =  r  with  probability  one  and 

m-»o^ 

>  m  (/•„,(  P\  P")  -  ;•)=>, V(0,  <t2(P',  P")/I:, ,[//,]“)  where 


rr3(P',  P")  = 


Varr[^/.',y]  2  Var,, ..[/// /•")/] 


-f  / 


0-P) 


(3.81 


The  optimal  run  length  allocation  between  the  numerator  and  the  denominator  will  be  considered 
in  Section  3,4. 


T3  A  Three  State  Example 

n  this  section,  we  consider  a  simple  availability  example,  namely  a  three  state  Birth  and  Death 
■  ss  (see  [23]).  Because  of  its  simple  structure,  the  optimal  zero  variance  importance  sampling 
distributions  can  be  derived  in  closed  form.  The  optimal  changes  of  measure  for  the  numerator  and 
denominator  are  quite  different.  These  results  would  be  of  no  significance  except  that  the  thmc  state 
example  serves  as  a  paradigm  for  more  complex  models  and  thus  strongly  suggests  a  basic  form  for 
effective  importance  sampling  distributions  in  more  complex  availability  models 
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The  state  space  is  /  —  M).  I.  21.  the  birth  rates  are  ).r  i  =  0.  I  and  the  death  rates  arc  /=  1.2.  In 
the  reliability  context,  this  models  a  system  with  two  identical  components  which  can  fail  and  he 
repaired  We  assume  that  births  correspond  to  failures  and  deaths  correspond  to  repairs  so  that 
state  i  corresponds  to  having  i  failed  components.  We  consider  the  system  to  be  operational  in 
states  0  and  I.  but  failed  in  state  2. 

The  embedded  DI'\1C  lias  the  following  non-zero  entries:  />((),  1  )  =  p(2, 1  )=  1 , 

p{  1,2)  =  r  =  z  |  /(z,  4  //|)  and  />(  1.0)  =  ( I  —  r).  1  etting  /i,  denote  the  mean  holding  time  in  state  i, 
then  1^  =  |/;„,  /i,  =  l/(/,  4-  /v , )  and  h2  =  l//i2.  We  assume  that  failure  rates  are  much  less  than  re¬ 
pair  rates,  specifically  we  assume  that  /i0  =  (-)(l/r).  /j ,  —  (-)( 1 )  and  h1  —  M(  I )  (we  follow  Knuth  s 
[24]  usage  of  fix)  =  (-)(g(.r))  if  there  exist  constants  ( and  (  '■>  such  that 

V.v,  0  <  <  |.g(.v)  <  fix)  <  ( '2g(.v)  I. 

The  steady-state  measure  r  <if  interest  is  the  stationary  probability  of  being  in  state  2.  the  steady- 
state  unavailability.  This  can  be  estimated  using  regenerative  simulation  with  function  values 
,?(()).  ,e(  1)  and  ,g(2)  equal  to  0.  0  and  lh.  respectively,  and  function  values  /it 0).  /it  I )  and  /i( 2)  equal 
to  V  /t,  and  /t2.  respectively.  Assume  state  0  is  the  regenerative  state.  We  first  compute  the  variance 
of  the  estimator  using  standard  regenerative  simulation  I  ct  n t  be  the  number  of  visits  to  the  failed 
state,  state  2,  during  a  regenerative  cycle  and  let  s(-  denote  the  (unique)  sample  path  of  a  regeneration 
cycle  of  the  !)T\1C  for  which  nf  ~  i.  Then  G  =  nrh2  and  II  =  /t,,  4-  h]  4-  nf.{h]  4-  h2)-  I  urthermorc, 

n2  has  a  geometric  distribution,  !’{«/  =  /'!  =  (1  —  r.)r  for  i>  0.  so  that  I  (.[«/ ]  =  r./(  1  -  r)  and 

Var|,[/Ir]  —  r./(  1  -  r.)2.  t  hus  l;(,[<7]  = /l2r/(l  -  r)  and  !',,[//]  =  (/i,,  4  /t, )  4  (/q  4- /?2)r/()  -  r). 
Straightforward  calculation:  show  that  r  —  C-)(r2)  and  that  the  asymptotic  squared  coefficient  of 
variation  of  rm (F)  (obtained  from  the  central  limit  theorem)  is 


Var(>[<7  —  rlf\ 
mr 2  I  f.[//]2 


f-)(  l/mt). 


(-TQ) 


The  dominant  term  in  liquation  3.9  is  due  to  contribution  of  the  numerator,  d  ims  to  obtain  a 
confidence  interval  with  a  relative  width  (width  divided  by  the  point  estimate)  that  goes  to  zero 
requires  that  the  sample  size  m  be  large  cnougli  so  that  mr.  — »  00.  This  demonstrates  the  potentially 
large  sample  size  required  for  rare  event  simulations  (in  which  r.stO). 

I  ct  P(s()  be  the  probability  of  a  regeneration  sample  path  s,  .  then  l*(s,)  -  ( I  -  r)r‘.  i>  0. 

dhc  optimal  zero  variance  importance  sampling  distribution  lv(s;),  />(),  for  estimating  !',.[//]  is 

computed  from  explicit  enumeration  of  all  sample  paths  first,  we  write 

I  |.[f»']  =  Vfris^I’fs,)  =  V[(7(s,)/.(s()]P'(s ,)  =  1  ,,,[/;/.]  .  with  /.(s()  =  P(s,)/P'(s;)  .  Now,  the 
1i  Vi 

optimal  P'(sf).  for  all  /  >  0.  can  be  computed  (similar  to  what  is  described  in  Section  2.1 ): 

1 1 
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(-V10) 


P(*/)«(s,) 


( 1  —  r)2  r  '  /  >  0, 


since  then  (»(s,-)/.(s ,),  for  all  />  0.  is  a  constant  equal  to  I'|»[<>] 

Similarly,  the  optimal  zero  variance  importance  sampling  distribution  for  estimating  Ep[//]  is  given 

by 


Pfs,)!  I  (s,-) 


[  (Ap  +  /'ll  +  0l\  +  h 2)  f  ]  (1  —  c)2  r* 

An  +  A,  -  r.  ( /i,,  -  /i2 ) 


(3.1 1) 


Erom  liquation  3.10,  P'  (s,,)  =  0,  P'  (s,)  =  (I  -  r)2,  P'*(s2)  =  2r(l  -  r.)2  and  so  on.  Now  let 

* 

p'  (1.0|r72  =  i)  denote  the  probability  of  going  from  state  1  to  state  0  given  that  the  chain  is  in  state 
I  anil  that  the  failed  state  has  already  been  visited  i  times.  Then 

p'  ( 1,01:?/  =  /)  -  P'  (Sj)Z(VP'  (S/))  and  thus,  from  liquation  3.10,  p'  (|,()|r?/.  -  ()(  =  0  and 

J>i 


p'  ( 1 .0|«/.-  =  i)  = 


(1  -c)2 

r.  (  /  - 


Therefore,  each  successive  time  the  simulation  enters  state  1,  the  probability  of  returning  to  state 
0  changes  (under  both  P'  (s)  and  P"  (s)).  Thus  the  optimal  changes  of  measure  for  both  the  nu¬ 
merator  and  the  denominator  of  liquation  2.1  are  dynamic.  In  particular  note  that  while 

/?'*(!. 0|/t/..  =  0)  =  0,  />'*(1.0|nf=  1)  =  (I -c)2*(l -2e)«(1  -r.)  =  /?(l,0)  for  r.«0.  Also. 

♦ 

liin/?'  ( 1 ,0|/J/;  =  i)  =  (1  —  e)  =  /?(1,0).  This  suggests  that,  for  more  complex  models,  the  importance 

f— *00 

sampling  distribution  for  the  numerator  should  be  chosen  to  move  the  system  very  quickly  to  the 
set  of  failed  states  /’,  but  that  once  F  is  entered,  the  importance  sampling  should  be  turned  off  so 
that  the  system  quickly  returns  to  state  0.  This  should  hold  true  for  systems  in  which  the  proba¬ 
bility  of  two  or  more  failures  in  a  regenerative  cycle  is  at  least  an  order  of  magnitude  less  likely  than 
the  probability  of  one  failure  in  a  cycle.  This  is  also  consistent  with  the  argument  given  in  Section 
2.1  as  well  as  Walrand's  suggestion  in  [44,  38]  (which  was  derived  using  large  deviation  results)  to 
interchange  ).  and  //  for  estimating  the  probability  of  buffer  overflow  in  the  M/M/1  queue. 

Tor  the  denominator,  on  the  other  hand,  the  largest  contribution  to  the  expected  value  comes  from 
the  sample  path  on  which  nr  =  0.  which  is  not  a  rare  event.  This  suggests  using  standard  simu¬ 
lation,  i.e.,  not  using  importance  sampling,  to  estimate  the  denominator.  Indeed,  the  optimal 
change  of  measure  of  Equation  3.1 1  has 


P"  (so)  =  /?"  (l.0|/tr  =  0)  = 


( 1  -  o2 

k  (  An  -  A2  )  /  (  A,,  +  /i, 


(l-r,)  =  />(!,()) 
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so  that  there  is  very  little  difference  between  I’"  (s)  and  l*(s)  for  the  most  likely  sample  path. 


3.4  Optimal  Run  Length  Allocation 

I  quation  .VS  gave  the  form  of  the  asymptotic  variance  when  a  fixed  number  m  of  cycles  arc  simu¬ 
lated  of  which  pm  are  devoted  to  simulation  of  the  numerator  and  (I  —  P)m  are  allocated  to  the 
denominator.  Since  the  expected  amount  of  CI’l'  time  to  simulate  a  sample  of  the  numerator  and 
denominator  may  be  different,  a  more  practical  run  length  allocation  model  can  be  formulated  as 
follows  I  ct  the  total  ('1*1'  time  be  I  and  assume  that  /?'/'  is  allocated  to  the  numerator  and 
( 1  -  /?)  /  is  allocated  to  the  denominator.  1  ct  rn  (rd)  denote  the  expected  CPI  time  to  simulate  a 
sample  of  the  numerator  (denominator).  Then,  for  large  7,  approximately  /?7'/r„  cycles  of  the  nu¬ 
merator  and  (1  —  /?)'/  / cycles  of  the  denominator  arc  obtained.  The  asymptotic  variance  of  the 
resulting  point  estimate  is 
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/  I  pC///]2 


P 


+ 


°2drd  \ 

) 


(3.14) 
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where  o„  =  Var, ..[<7/  /.'t/]  and  nd  =  Varp..[//y  This  result  is  obtained  by  applying  results  from 
renewal  theory  (sec  [4.3]).  Minimization  of  3.14  with  respect  to  /?  yields  /?  =  <$/(  1  -f  <))  where 

=  nn\  cn  Kr(,d\lcd  )■ 

Suppose  that  cn^cd  and  that  onvnd  (we  are  equally  effective  in  reducing  the  variance  of  the  nu¬ 
merator  and  denominator).  Then,  for  estimating  the  steady-state  unavailability,  r  is  small  and 

* 

P  v  1.  i.e..  the  bulk  of  the  effort  should  be  applied  to  estimating  the  numerator,  which  in  this  ease 

is  a  rare  event  simulation  using  importance  sampling.  On  the  other  hand,  for  estimating  the  MTIT 

+ 

using  the  ratio  formula  given  in  liquation  2.10,  r  is  large  and  p  ^0,  i.e.,  the  bulk  of  the  effort  is 
devoted  to  the  denominator.  However,  for  the  MTIT,  the  denominator  also  corresponds  to  a  rare 
event  simulation  using  importance  sampling  (moreover,  as  will  be  discussed  in  Section  5,  the  same 
importance  sampling  distribution  can  be  used  to  estimate  both  measures).  Thus  in  either  ease,  the 
optimal  allocations  are  consistent  in  the  sense  that  they  allocate  most  of  the  effort  to  rare  event 
simulation. 

In  practice,  we  always  devote  a  minimum  percentage,  say  10%,  of  the  effort  to  standard  simulation 
even  though  the  optimal  allocation  usually  suggests  devoting  much  less  time  to  standard  simulation. 
Phis  permits  stable  variance  estimation  and  the  loss  in  asymptotic  efficiency  from  the  optimal  al¬ 
location  is  small. 
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3.5  Bias  Expansions 

We  now  consider  bias  expansions  of  ratio  estimators  of  steady-state  measures  Because  the  nu¬ 
merator  and  denominator  are  simulated  independently,  some  specific  conclusions  can  be  drawn 
from  these  expansions.  References  for  this  type  of  bias  expansion  may  be  found  in  Section  27  of 
[7],  Chapter  2  of  [27]  or  [12]  They  arc  derived  using  Taylor  scries  expansions  and  multidimen¬ 
sional  central  limit  theorems.  1  ct  <('„  =  (C„(l) . ('„(</))■  n  >  1 }  be  a  sequence  of  iid  vectors  of 

length  d  and  let  /i  =  (/<, . // d)  where  ![(',,]  =  ft.  Suppose  we  are  interested  in  estimating  g(fi)  for 

some  function  e  In  the  ease  of  ratio  estimation  //  =  (//,.// 2)  and  g(/i)  —  I  ct 

_  >»l 

('„,  =  (!  »?n  v  Cf;  Then,  under  appropriate  technical  conditions  on  £  and  the  moments  of  C„  , 


d  d 

=  .*(/•)+  +  o(l/m)  <-v,5) 
1=1  ,/=l 

-j2 

where  “  Cov[  („(/).  ('„(/)]  and  ^  ^(x)  |  In  our  case,  C„  =  (<7, 

■V  •’9  r 

on  =  Varr i ^].  "22  =  Var,,. and  o,2  =  0  (since  the  numerator  and  denominator  are 
simulated  independently  ).  Note  that  in  the  above  vve  assume  for  simplicity  that  m  cycles  of  both 
the  numerator  and  denominator  are  simulated.  Differentiation  of  #  yields  £n=0, 

£22  =  2// 1  / // 2  =  2 rift\  and  £)2  =  -  1  ///j-  Since  o12  =  0.  the  value  of  gn  does  not  enter  into  the 
MSDIS  bias  expansion.  Therefore 


?(/•)  + 


rVarr.  [//„/.",„] 


2 

m  ,,2 


+  o(  1  /m) . 


(2.1b) 


l  or  the  measures  of  interest  in  availability  modeling,  r  >  0,  //„  >  0  and  //2  >  0  so  that, 
asymptotically,  I;[g((.m)]  >  g(/i).  Furthermore,  this  asymptotic  bias  expansion  is  independent  of  the 
importance  sampling  distribution  I”  chosen  for  simulation  of  the  numerator. 

lor  the  steady-state  unavailability  we  select  I’"  =  I*.  By  the  results  of  Section  2.2.  in  the  three  state 
example  r  =  (r)(r2),  Var(,[//„]  =  0(r.)  and  //2  =  FP[//„]  =  0(!/s).  Therefore,  the  leading  term  in  the 
bias  expansion  is  of  order  r^jm  (relative  bias  is  of  order  r.?/n?  )  which  is  typically  quite  small.  With 
standard  simulation  the  bias  expansion,  which  now  includes  the  effect  of  correlation  between  the 
numerator  and  denominator,  becomes 
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//«)] 


£(/')  + 
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+  o(l  /m)  . 


(2.17) 
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For  the  three  state  example.  Cov,,[(7,r  //„]  =  4  AjjVarpjn,.]  =  C-)(r.)  so  the  dominant  term  in 

the  bias  expansion  is  ( ■0(r.1/m )  which  is  significantly  larger  than  the  bias  obtained  using 

MSD1S.  Moreover,  using  standard  simulation,  the  relative  bias  (bias  /)  is  only  0(r.//n).  These  ob¬ 
servations  are  consistent  with  the  experimental  results  described  in  Section  6. 

For  the  Ml  II.  r  is  large  and  =  Pj'*/  <  ’"*0)  ’s  small,  which  potentially  makes  the  leading  bias 
term  large.  However,  as  seen  from  I  quation  .116,  choosing  an  importance  sampling  distribution 
P"  for  the  purpose  of  reducing  variance  also  has  the  beneficial  effect  of  reducing  bias. 

In  practice,  m  may  have  to  be  very  large  in  order  for  these  asymptotic  expansions  to  be  valid.  In 
particular,  for  small  values  of  m  the  higher  order  terms  may  contribute  in  a  non-ncgligible  way  so 
that,  e  g..  F[, £(('„,)]  <  £(/«)■  If  bias  turns  out  to  be  of  significant  concern,  then  a  bias  reducing 
technique  such  as  jackknifing  may  be  used  to  remove  the  leading  term  of  order  I  jm  in  the  bias  ex¬ 
pansion  (see  [!']) 
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4.  Estimating  Transient  Measures 


Simulation  of  the  CT\1C  Y  consists  of  two  parts:  simulating  the  sequence  of  states  visited  by  the 
embedded  DTMC  Y  with  transition  matrix  1*,  and  simulating  the  holding  times  in  each  of  the 
states  We  let  denote  the  holding  time  in  state  A).  Thus  given  that  A \—j,  tj  has  an  exponential 
distribution  with  mean  1  fqij)  and  the  (conditional)  likelihood  of  it  is  simply  q{i)c~g^^r'.  We  let 

t,(  =  (t0 . tn)  denote  the  first  n  4-  1  holding  times  of  the  (TMC.  (iiven  that  X,  =  (A'„ . V„),  the 

likelihood  of  t„  is  therefore 

f(«„IX„>  =  tfA',,)  r"*(-Vn),n  ...  q(X„)  c-*™"  (4.1) 

and  thus  the  likelihood  of  the  sample  path  (X„.  t„)  is 

Q(X„,t„)  =  P(X„)f(t„|X„)  (4.2) 

where  P(X„)  was  defined  in  liquation  3.2.  liquation  4.2  gives  the  likelihood  at  the  times  of  the 
jumps  of  the  embedded  DTMC. 

We  basically  adapt  the  development  in  [14]  in  order  to  extend  Proposition  3.1.  Define  7'()  =  0  and 

7„  =  l0  4-  for  n  >  1.  Then  Tn  is  the  time  at  which  state  Xn  is  entered,  i.c.,  the  time  of  the 

n  th  transition.  1  et  Y,  =  ()\,  0  <  ,v  <  /).  let  r  be  an  integer  valued  stopping  time  with  respect  to 
the  sequence  of  pairs  {(A'„.  tn),  n  >  0},  i.c.,  the  event  {t  =  «)  can  be  determined  by 
(,Y0,  tn), ...  ,(A’„, /„).  We  let  Q'  denote  another  measure  for  generating  sequences  {(A'„,  /„).  n  >  0). 
We  will  specifically  assume  that  Q'(X,r  t„)  =  P'(X„)  f'(t;i|X„).  With  this  factorization,  the  form  of 
the  contribution  of  the  holding  times  to  the  likelihood,  f  (t„|  X„),  is  almost  arbitrary  (the  restrictions 
will  be  discussed  below),  but  the  sequence  of  states  selected  docs  not  depend  upon  the  holding 
times.  I  et  /?„  be  the  subset  of  the  sample  paths  of  Y  for  which  t  =  n. 

Proposition  4. 1 : 

let  r  be  an  integer  valued  stopping  time  which,  under  Q,  is  finite  with  probability  one.  Define 
rt  =  7  T  and  let  7  be  a  (measurable)  function  of  Ya  for  which  Fy[  1 7(\rl)  |]  <  oo.  I  et  Q'  be  another 
measure  of  the  form  Q'(X„,  t„)  =  P'(X„)  f'(t„  |X„)  such  that,  under  P',  t  is  finite  with  probability- 
one  and  for  any  (s„,  t„)  e  /?„,  P'(s„)  f'(t„|s„)  =£  0  whenever  ZfY^Pfs,,)  f(t„|s„)  -f-  0.  Then 

I:-q[^(YJ]  =  r.y.[Z(Y,)/.',(Xr)/.'2(X..tr)]  (4.3) 

where  for  any  n ,  //2(X„,  t„)  =  f(t„|s„)/f  '(tjs„). 
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The  proof  of  this  Proposition  is  essentially  the  same  as  that  of  Proposition  .VI.  Notice  that  if  the 
stopping  time  r  of  Proposition  4.2  is  r  =  t;  ,  then  the  o<  of  Proposition  4. 1  is  first  time  to  failure,  i.e., 
o<  =  mf .  Measures  defined  over  a  fixed  time  interval  (()./)  (e.g.,  the  expected  interv  al  availability)  are 
handled  in  this  formulation  by  defining  r=A'(/)  +  I  where  ,Y(f)  =  max{/t:7„  <  t).  The  reliability 
R{t)  is  handled  by  setting  r  =  min  (r/.-,  A ■'(/)+  1)  (since,  with  this  definition,  at  time  7\  either  a 
failure  has  occurred  or  simulated  time  has  surpassed  !)  and  setting  /  —  lj7-  <ty 


4.1  Estimating  the  Reliability 

By  Proposition  4.1.  there  are  two  importance  sampling  distributions  to  construct,  corresponding 
to  two  likelihood  ratios.  The  first  distribution  is  for  the  embedded  DTMC  f corresponding  to 
/.' I (XT))  and  the  second  is  for  the  state  holding  times  given  the  DTMCs  sample  path  (corre¬ 
sponding  to  /.'2(Xr,  t.)).  I  ewis  and  Bohtn  [28]  presented  a  technique  for  estimating  the  reliability. 
They  apply  “failure  biasing"  to  the  embedded  DTMC;  this  causes  failures  to  occur  with  higher 
probability  and  therefore  quickly  moves  (biases)  the  DTMC  towards  the  set  ot  failed  states.  They 
also  apply  “forced  transitions”  to  the  holding  time  in  state  0  (the  state  with  .all  components  opera¬ 
tional).  This  forces  the  next  component  failure  to  occur  before  time  l  Specifically,  if  Xn  —  0  and 
T„  <  ?,  then  the  next  holding  time,  /w+t  is  forced  to  be  between  zero  and  I  -  Tn  by  selecting  t„+] 
from  the  conditional  density  given  by 


nWl  l  ^n-  'n 


x„.t„)  = 


X0e~  1 
,  _  c-  W-  K)  • 


r„. 


(4.4) 


where  2.0  is  the  total  failure  rate  in  state  0.  Ihe  simulation  continues  until  time 
r  =  min  (t p,  N(t)  +  1). 

Ross  and  Schcchner  [37]  propose  an  alternative  approach  in  which  some,  or  all,  of  the  holding 
times  arc  conditioned  out.  If  all  holding  times  arc  conditioned  out  then  no  holding  times  arc 
sampled  and  we  set  Z  =  P{7'.^ <  t|XT^}.  Calculation  of  Z  requires  computing  the  convolution  of 
exponentially  distributed  random  variables  with  different  means,  f  or  a  sequence  of  n  states,  this 
can  be  done  in  C-)(n2)  time  using  the  recursions  in  [  39  ].  Using  failure  biasing,  t will  typically  be 
small  so  that  earning  out  this  computation  is,  in  principle,  not  an  obstacle.  However,  an  effective 
and  much  simpler  approach  is  to  only  condition  out  the  total  holding  times  in  state  0  (which  typ¬ 
ically  represents  the  bulk  of  the  time  anyway).  The  embedded  DTMC  is  simulated  until  the  set 
of  failed  states  is  entered,  i.e.,  until  time  rf..  Holding  times  in  the  non-zero  states  arc  randomly 
sampled,  but  no  holding  times  arc  sampled  for  state  0.  I  c(  /?  denote  the  total  holding  time  in  states 
other  than  ():/?=  X  ^  x  If.v,  =*())■  I  «t  denote  the  number  of  visits  to  state  0  and  let  >■  be  a  rv 
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denoting  the  total  holding  time  in  state  0.  Given  y  has  an  I  rlang  distribution  with  shape  pa¬ 
rameter  rif)  and  seale  parameter  20  and  we  write  Ply  <  rl«o}  =  4).  We  then  set 

/  =  P(^  <t|Xv/?}  =  Hy<f- /?!«,,}  =  /•.„„(/-/?.  A0)  .  (4.5) 

l  nlike  Ross  and  Scheehner,  we  apply  the  conditional  Monte  Carlo  approach  in  addition  to  some 
form  of  failure  biasing.  By  the  variance  reducing  property  of  conditional  expectations,  (i.e.,  since 
Var[l  [,Vl  )"]]  <  Var[.V],  see,  e.g.,  page  12  of  [36]),  the  conditional  approach  plus  failure  biasing  is 
always  guaranteed  to  reduce  the  variance  over  just  failure  biasing.  To  see  this,  notice  that 

Var,,.[  //,(XTjf)  iK<I}  ]  >  Varr[  I  [  //,(XT(  )  I  XTf,  /?  ]  ] 

=  Var p.[/-',(Xrr)^(/ -/?.;„)]  ■ 

While  no  such  analytic  result  exists  for  comparing  conditioning  with  forcing,  the  conditioning  ap¬ 
proach  has  several  advantages  over  the  forcing  approach,  hirst,  with  forcing,  different  holding  times 
must  be  generated  for  each  value  of  t  for  which  R(l)  is  to  be  estimated.  Because  of  sampling  errors, 
the  estimates  of  R(t)  may  not  be  monotonic  in  t  Using  the  conditional  approach,  simultaneous, 
monotonic  estimates  of  R(t)  arc  obtained.  Second,  with  forcing,  different  conditional  holding  time 
distributions  arc  used  and  different  likelihood  ratios  must  be  maintained  for  each  value  of  t  for 
which  R(i)  is  to  be  estimated.  This  is  not  necessary  in  the  conditional  approach.  Thus,  it  has 
computational  time  advantage  when  R(t)  is  computed  for  multiple  values  of  t  simultaneously. 

Another  approach  would  be  to  use  the  technique  of  uniformization  (sec  [19]).  A  discussion  of 
approaches  to  using  uniformization  in  simulations,  including  discrete  conversions,  may  be  found 
in  [foxglynn!.].  In  our  context,  failure  rates  arc  much  less  than  repair  rates  and  therefore 
where  I  =  max{<7(/)}  is  the  maximum  state  exit  rate.  The  number  of  transitions  in  the  uniformized 
chain  before  exiting  state  0  (sometimes  called  “pseudo  transitions”)  is  geometrically  distributed  with 
success  parameter  2n/2« 0.  Therefore,  effective  estimation  of  these  rare  event  measures  requires  us¬ 
ing  some  sort  of  importance  sampling  on  the  number  of  stale  0  pseudo  transitions.  This,  in  turn, 
is  very  similar  to  using  forced  transitions. 

4.2  Estimating  the  Expected  Interval  Availability 

In  this  section  wc  present  two  methods  to  estimate  quantities,  such  as  the  expected  interval  avail¬ 
ability,  which  take  the  form  r(l)  =  l:[f‘  }\)</.v].  Wc  assume  that  is  small  so  that  very  fcw: 

failures  are  expected  by  time  t.  The  first  method,  due  to  I.cwis  and  Bbhm  [28],  uses  failure  biasing 
and  forcing  as  described  in  Section  4.1.  ITie  simulation  ends  at  time  7(V^f)+,  =  f0  +  —  +  With 
this  notation,  is  the  holding  time  that  crosses  the  threshold  l.  A  practical  implementation  of 
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this  method  typically  turns  off  the  forcing  after  /.  visits  to  state  0  at  some  value  of  /  for  which 
/.’/(/,/,))  is  extremely  small;  without  this  modification  \'(t)  may  grow  to  he  quite  large  and,  fur¬ 
thermore,  the  simulation  may  generate  extremely  unlikely  sample  paths  having  an  unusually  large 
number  of  v  isits  to  state  0  in  the  interval  (0,/). 

To  apply  the  conditional  Monte  Carlo  approach  to  r(t),  we  begin  with  an  important  result  from 
box  and  Glynn  [11]: 

m- 1 

>V)  -  1:[  ^  X(Xk)  ]  (4.7) 

A-=o 

where  g{i)  =  f[i)lq(i)  and  the  expectation  is  with  respect  to  the  transition  matrix  I*.  Now,  as  sug¬ 
gested  in  [11],  we  could  generate  holding  times  for  the  sole  purpose  of  determining  A'(t),  and  then 

m- 1 

ignore  these  holding  times  by  using  X  £(*V*)  to  estimate  r(t).  However,  we  would  still  have  to 

k=n 

use  conditioning  or  some  sort  of  importance  sampling,  such  as  forcing,  on  the  holding  times  in  state 
0  since  otherwise  /\:(t)  =  0  with  high  probability.  Similarly,  uniformi/ation  implementations  based 
on  liquation  4.7  would  also  require  importance  sampling  on  the  number  of  state  0  pseudo  transi¬ 
tions  in  order  to  be  effective.  To  combine  forcing  with  liquation  4.7,  we  w  rite 

iV(0- 1  N(t)~  1 

r( 0  =  F;[  X  ^*)|<b>,]P{/o>0  +  r-C  V  g(A^)|tn<t]P{t0<t} 

k=0  k-t) 


=  I'[  X 


since  if  ln  >  t  ,  then  N(l)  =  0  and  therefore 
with  failure  biasing. 


X  x(*k)  =  o  • 


*=n 


bquation  4.8  can  also  be  combined 


We  next  extend  liquation  4.7  in  a  way  that  allows  us  to  condition  out  the  state  0  holding  times. 
While  the  development  below  is  in  terms  of  the  original  embedded  DTMC,  the  results  extend  di¬ 
rectly  to  using  importance  sampling  as  described  in  Proposition  3.1.  We  also  present  the  method 

in  terms  of  conditioning  out  only  the  holding  times  in  state  0,  although  the  method  also  applies 

k 

more  generally.  Analogous  to  the  approach  in  Section  4.1,  define  n^k.)  =  X'(,v=o}  an^ 

k  7=0  ' 

(ik  =  V;yl{  v  .  With  these  definitions,  k )  is  the  number  of  visits  to  state  0  and  (ik  is  the  total 

/= o 

holding  time  in  the  non-7.ero  states  at  the  k’ !h  transition.  Then,  by  l  quation  4  7 
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(4.9) 


<*"*  OO 

^(0  =  *  '[^£(-'*4  *{/r<V(/)-l}  ]  =  '  ( ,%'(/)>/<  + 1 )  ] 

k=n  k=o 

OO 

=  y,l'[ l[wi)>nn  /?£,]]  =  l  [  y^^(Xk)  l.nnik)(t ~  Pk'  4) 3  ■ 

k-0  k=t) 

I  he  key  step  in  the  above  derivation  follows  since  f  i\(l)>k+  1)  =  {t()  +  —  +  lk  <  /}.  The  ex¬ 
changes  of  expectation  and  summation  are  easily  justified  for  finite  state  spaces  by  using  the  dom¬ 
inated  convergence  theorem  (see  [3]). 

To  apply  Fquation  4.9  requires  determining  a  stopping  criterion.  We  could  simulate  until  fik  >  t 

at  which  point  l:n^(t  —  fik.  /0)  —  0.  However,  since  repairs  arc  fast,  flk  grows  slowly  and  therefore 

an  excessive  number  of  transitions  may  have  to  be  simulated.  The  summation  could  be  truncated 

at  some  finite  value.  However,  this  introduces  bias  error.  While  the  error  is  easily  bounded,  we 

prefer  unbiased  estimates,  particularly  for  quantities  such  as  the  interval  unavailability  which  itself 

is  quite  small.  A  simple  unbiased  estimate  can  be  constructed  in  a  reasonable  amount  of  time  as 

follows:  after  the  /.’th  visit  to  state  0,  begin  sampling  the  state  0  holding  times  and  adding  them  to 

flk.  Very  quickly,  fik  will  exceed  t  and  the  sample  is  then  complete.  More  formally,  let  A0( /.)  be 

the  (discrete)  time  at  which  state  0  is  entered  for  the  /.’th  time.  For  k  >  Nn{l.)  +  1 ,  let 
-  k 

Pk  =  P km )  +  X  ■  ,  hcn-  arguing  as  above, 

7=  ’Vn<  /)+ 1 

A’n  (/•>  oo 

r(t)  =  F.[  %(.\k)  /,,„(£)(<  -  Pk,  ^n)  ]  4-  l  [  ^  g(\k)  Ef(t  -  fik,  / -n)  ]  .  (4.10) 

*■=0  *r=/V„(/.)+l 

I  he  estimators  for  the  distribution  of  interval  availability  can  be  formulated  in  a  similar  way.  We 
derive  these  estimators  in  Appendix  A  for  the  sake  of  completeness. 
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5.  Implementation  Issues 


In  this  section  we  consider  the  implementation  of  the  different  variance  reduction  techniques  de¬ 
scribed  in  the  previous  sections.  These  techniques  have  been  implemented  in  the  SAVT  package 
[15.  17]  so  that  large  models  can  be  simulated.  One  salient  feature  of  our  implementation  is  that 
we  use  one  simulation  run  for  estimating  all  the  measures.  Regenerative  simulation  is  used  with  the 
all  components  operational  state  as  the  regeneration  state  The  event  generator  simulates  only  the 
embedded  Markov  Chain  (DI'MC  formulated  in  Section  .VI).  l  or  the  steady-state  measures  we 
accumulate  functions  of  the  mean  holding  times  in  the  various  states,  and  for  the  transient  measures 
we  accumulate  functions  of  the  sample  holding  times  (from  exponential  distributions)  in  the  various 
states.  In  the  following  paragraphs  we  describe  the  implementation  of  the  importance  sampling 
technique  for  the  various  measures. 

Recall  that  we  formulated  the  likelihood  ratios  for  the  transient  measure  in  Proposition  4  1  as  the 
product  of  two  likelihood  ratios  /  ' , ( \ ,)  and  /.'2(X..  tj.  The  first  likelihood  ratio  corresponds  to 
the  embedded  Markov  chain  and  it  is  needed  for  the  steady-state  as  well  as  the  transient  measures 
as  indicated  in  Propositions  .VI  and  4.1.  On  the  other  hand.  /.'2(X T.  t.)  corresponds  to  the  holding 
times,  given  a  sample  path  on  the  embedded  D  I  MC;  this  likelihood  ratio  is  needed  only  for  tran¬ 
sient  measures  and  is  different  for  different  transient  estimators. 

Ihe  importance  sampling  for  the  embedded  Markov  chain  is  based  on  the  following  heuristics. 
As  suggested  in  Section  3.3,  we  need  to  move  the  system  very  quickly  to  the  set  of  failed  states  /. 
and  once  /  is  entered,  the  importance  sampling  should  be  turned  off  so  that  the  system  quickly 
returns  to  state  0,  the  all  components  operational"  state  We  achieve  this  by  increasing  the  prob¬ 
ability  of  failure  transitions  over  repair  transitions.  Phis  has  been  called  “failure  biasing"  in  [2S], 
We  assign  a  combined  probability  bias  1  to  the  failure  transitions  in  all  the  states  where  both  failure 
and  repair  transitions  are  feasible.  Individual  failure  and  repair  transitions  arc  selected  in  the  ratio 
of  their  rates  given  that  a  failure  or  a  repair  is  selected,  respectively.  We  call  this  the  Bias  /  Ratio 
method,  or  simply  Bias!  method.  We  have  found  two  other  methods  useful  for  selecting  indiv  idual 
failure  transitions,  given  that  a  failure  has  occurred.  The  first  is  to  use  a  uniform  distribution  on 
the  failure  transitions  which  has  very  good  performance  for  “unbalanced  systems"  as  shown  in 
Section  fr.  We  call  this  the  Bias/  Balancing  method.  The  second  is  to  give  a  higher  combined 
probability  hias2  to  those  failure  transitions  which  correspond  to  component  types  which  have  at 
least  one  component  of  their  type  already  failed.  This  exhausts  the  redundancy  quickly  and  has 
much  better  performance  for  “balanced  systems"  as  shown  in  Section  6  We  call  this  the 
Bias/  Bias2  method. 
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1  or  the  steads -state  availability  each  regenenttive  cycle  corresponds  to  a  sample.  We  use  either  the 
I1IS  in  the  MSDIS  method  given  in  Section  V2  to  estimate  the  steady-state  availability  l  or  the 
mean  time  to  failure,  a  sample  ends  when  either  the  regeneration  occurs  or  the  system  enters  one 
.'I  the  system  failed  states  from  the  set  / .  In  the  latter  ease,  we  continue  to  simulate  the  embedded 
Markov  chain  until  the  regeneration  occurs  before  starting  a  new  sample.  This  wastes  only  a  few 
events  as  typically  a  regenerative  cycle  has  a  very  few  events  (approximately  twice  the  average  re¬ 
dundancy  which  is  typically  2  or  2)  Once  again,  we  use  either  the  DIS  or  the  MSDIS  method  to 
estimate  the  mean  time  to  failure  l  or  the  transient  measures,  multiple  regenerative  cycles  may  be 
contained  in  a  sample  Moreover,  a  sample  typically  ends  either  when  a  failure  occurs  or  when  the 
tin  e  interval  expires,  which  is  usually  in  the  middle  of  some  regenerative  cycle  As  in  the  mean  time 
to  failure  rase,  we  continue  to  simulate  the  embedded  Markov  chain  until  the  next  regeneration 
occur*  before  starting  a  new  sample  Separate  accumulators  for  the  appropriate  likelihood  ratios 
arc  maintained  tor  each  transient  estimator  and  for  each  time  horizon  of  interest.  Thus,  all  meas¬ 
ures  can  be  estimated  simultaneously  from  a  single  simulation  run 
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6.  Examples  and  Discussions 


in  this  section,  wc  provide  an  example,  based  on  a  model  of  a  computing  system,  to  illustrate  the 
effectiveness  of  the  different  variance  reduction  techniques  discussed  in  the  previous  sections.  A 
block  diagram  of  the  computing  system  considered  is  shown  in  Figure  1.  We  use  two  different  pa¬ 
rameter  sets  to  create  a  "balanced"  and  an  “unbalanced"  system.  A  balanced  system  is  one  in 
which  each  type  of  component  has  the  same  amount  of  redundancy,  (i.c.,  same  number  of  com¬ 
ponents  of  a  type  must  faii  in  order  that  the  system  fail,  e  g.,  l-out-of-2  of  a  type  has  the  same  re¬ 
dundancy  as  3-out-of-4  of  another  type);  in  addition,  the  components  must  have  the  same  order 
of  magnitude  failure  rates.  A  system  that  is  not  balanced  is  unbalanced. 


Figure  1.  A  block  diagram  of  the  computing  system  modeled 


For  a  balanced  system  wc  select  two  sets  of  processors  with  2  processors  per  set,  two  sets  of  con¬ 
trollers  with  2  controllers  per  set,  and  6  clusters  of  disks,  each  consisting  of  4  disk  units,  fn  a  disk 
cluster,  data  is  replicated  so  that  one  disk  can  fail  without  affecting  the  system.  Flic  "primary”  data 
on  a  disk  is  replicated  such  that  one  third  is  on  each  of  the  other  three  disks  in  the  same  cluster. 
Thus  one  disk  in  each  cluster  can  be  inaccessible  without  losing  access  to  the  data.  The 
connectivity  of  the  system  is  shown  in  Figure  I.  Wc  assume  that  when  a  processor  of  a  given  type 


6  F.xamples  and  Discussions 


25 


tails,  it  lias  a  H  OI  probabilitx  of  causing  the  operating  processor  of  the  other  type  to  fail,  lach 
unit  in  the  system  has  two  failure  modes  which  occur  with  equal  probability.  The  failure  rates  of 
processors,  controllers  and  disks  are  assumed  to  be  1  2000,  I  200(1  and  I  6000  per  hour,  respec¬ 
tively.  I  he  repair  rates  for  all  mode  1  and  all  mo  'e  2  failures  are  I  per  hour  anil  I  2  per  hour,  re¬ 
spectively.  Components  are  repaired  by  a  single  repairman  who  chooses  components  at  random 
from  the  set  of  tailed  units.  The  system  is  defined  to  be  operational  if  all  data  is  accessible  to  both 
processor  types,  w  hich  means  that  at  least  one  processor  of  each  type,  one  controller  in  each  set, 
and  2  out  of  4  disk  units  in  each  of  the  6  disk  clusters  are  operational.  We  also  assi  me  that  oper¬ 
ational  components  continue  to  fail  at  the  given  rates  when  the  system  is  failed. 

We  make  minor  changes  to  the  above  parameters  setting  in  order  to  create  an  unbalanced  system. 
We  increase  the  number  of  processors  of  each  type  to  4.  and  double  each  processor's  failure  rate 
to  1  1000  per  hour.  We  decrease  the  failure  rates  of  all  other  components  by  a  factor  of  ten.  In  this 
system,  although  a  processor  failure  is  more  likely  to  occur  in  a  failure  transition,  it  is  less  likely  to 
cause  a  system  failure  due  to  the  high  processor  redundancy.  This  is  typical  behavior  for  an  un¬ 
balanced  system. 

6.1.  Steady-State  measures 

In  this  section  w  e  discuss  the  results  of  our  experiments  for  estimating  the  steady-state  unavailability 
and  the  mean  time  to  failure.  Numerical  (non-simulation)  results  for  these  measures  were  obtained 
using  the  SANT  package  [17],  Since  the  balanced  system  has  a  few  hundred  thousand  states  and 
the  unbalanced  system  has  close  to  a  million  states,  only  bounds  could  be  computed  [44]  These 
bounds  are  very  tight  and  typically  do  not  differ  from  the  exact  results  significantly.  We  simulate 
both  the  balanced  and  the  unbalanced  systems.  The  goai  f  i...  .'•.•nutation  experiments  is  to  study 
the  efficiency  of  the  importance  sampling  methods,  described  in  this  paper,  compared  to  standard 
simulation.  We  also  experimented  with  the  MSDIS  technique  described  in  Section  4  It  is  shown 
that  the  Bias!  method  gives  many  orders  of  magnitude  variance  reduction  over  the  standard  Monte 
Carlo  simulation.  Moreover,  further  significant  improvements  can  be  obtained  using  the 
Bias/  Bias!  method  for  the  balanced  systems  and  Bias  I  ■  Balancing  method  for  the  unbalanced  sys¬ 
tems  f  urther  improvements  arc  obtained  when  these  methods  arc  combined  with  MSOIS 

Table  1  and  fable  2  show  the  results  obtained  for  the  balanced  and  the  unbalanced  systems,  re¬ 
spectively.  We  ran  the  simulation  long  enough  sir  that  the  smallest  entry  in  the  tables  for  the  per¬ 
centage  relative  half-widths  of  the  90%  confidence  intervals  was  less  than  5%.  The  percentage 
relative  half-width  of  a  confidence  interval  is  defined  to  be  100%  times  the  confidence  interval 
half-width  divided  by  the  point  estimate.  This  corresponds  to  approximately  100.000  events  for 
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each  entry  in  1  able  1  anti  1,000,000  events  for  each  entry  in  Table2,  respectively,  l  or  the  MSDIS 
entries,  we  assigned  10°  n  of  the  total  events  to  estimate  the  denominator  (numerator)  for  unavail¬ 
ability  (\1 1  1  I  )  as  suggested  in  Section  T4.  Based  on  impirical  results  obtained  in  [4],  [1R]  and 
[41],  the  values  for  bias  1  and  bias2  were  selected  as  follows:  for  IMS,  .5  and  ,5,  and  for  MSDIS,  .9 
and  .9. 

There  are  a  few  important  points  to  note  in  the  tables,  l  or  the  balanced  system,  the  Bias II Bias! 
method  is  most  effective,  which  supports  our  intuition  that  it  helps  push  the  system  quickly  towards 
a  likely  path  to  failure,  l  or  the  unbalanced  system,  the  Bisa I / Balancing  is  the  most  effective 
method,  which  also  supports  our  intuition  as  follows.  By  making  individual  failures  equally  likely 
we  are  also  increasing  the  failure  probability  of  a  more  reliable  but  less  redundant  component,  thus 
leading  to  a  more  likely  path  to  failure. 

Also  note  that  the  percentage  relative  half  widths  for  both  the  steady -state  unavailability  ( l  )  and 
the  mean  time  to  failure  (M  i  l  l  )  are  approximately  equal.  This  is  because  the  estimate  of  ('  is 
approximately  proportional  to  the  estimate  of  1/M  I  I  I'.  To  sec  this,  using  the  notation  of  Section 
2,  minfnf/.,  oi())  =  tn  with  high  probability  when  no  importance  sampling  is  used.  Thus  an  indi¬ 
vidual  sample  r.v.  in  the  numerator  of  the  ratio  for  M  l' I  T  (liquation  2.10)  is  equal  to  the  r.v.  in 
the  denominator  of  l'  (liquation  2.8)  with  high  probability  Now  for  the  three  state  model,  a 
sample  r.v.  Cl  in  the  numerator  of  ( !  is  G  —  fi2  x  nr  where  nr.  is  the  number  of  times  the  failure  slate 
is  entered.  (sing  our  importance  sampling  schemes,  (7  =  h2  !{„,  ~ij  with  high  probability.  Fur- 
thcrmorc,  l{„r  =  i)  =  !{*,.< „„)  with  high  probability  so  an  individual  sample  r.v.  in  the  denominator 
of  the  MTTF  ratio  is  proportional  to  the  r.v.  in  the  numerator  of  V  with  high  probability.  Thus, 
an  estimate  of  (  is  approximately  proportional  to  1/M  TIT.  Finally,  direct  manipulation  of  the 
asymptotic  variance  (liquation  3.6)  shows  that  the  relative  half  width  of  a  ratio  is  equal  to  the  rel¬ 
ative  half  width  of  its  reciprocal. 

We  next  performed  the  so  called  coverage  experiments  (sec  c.g.,  [25])  to  determine  the  validity  of 
the  confidence  intervals  that  are  formed  based  on  the  asymptotic  central  limit  theorems  described 
in  Section  3.  Such  studies  are  important  since  certain  variance  reduction  techniques  sometimes  do 
not  produce  valid  confidence  intervals,  except  for  very  long  run-lengths  (see  c.g.,  [25])  In  such 
cases,  the  variance  reduction  technique  cannot  be  relied  upon  to  actually  shorten  simulation  run 
lengths. 

We  performed  experiments  on  estimates  of  the  steady-state  unavailability  ,  (  ,  in  the  above  described 
balanced  system  as  follows  We  chose  three  run  lengths  corresponding  to  small,  medium  and  large 
sample  si/es  and  we  considered  three  ways  of  estimating  l’:  standard  simulation,  the  Bias/ ! Bias2 
method  with  D1S  and  the  Bias! i Bias!  method  with  MSDIS.  For  each  method  ami  run  length  we 
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ran  R~  100  replications  and  formed  point  estimates  () . l'R  and  90%  confidence  internals. 

R  r_ 

We  then  calculated  the  mean  percent  relative  bias  (=  1 00"  x  (  UR)  X  ((/  —  t  )/(  )  and  the 

/=  i 

standard  deviation  of  this  mean.  Note  that  if  an  estimate  is  unbiased,  then  its  mean  percent  relative 
bias  should  converge  to  zero  as  R  -•  co.  We  also  calculated  the  90%  coverage  which  is  the  per¬ 
centage  cif  die  (alleged)  90%  confidence  that  actually  contain  the  true  value  (  If  the  confidence 
interval  is  valid,  then  by  definition,  the  90%  coverage  should  be  close  to  90%, 

We  also  computed  the  mean  percent  relative  half  width  of  the  90%  confidence  intervals.  The  re¬ 
sults  are  listed  in  l  ablc  V  Note  that,  as  anticipated  from  Section  T  5.  the  standard  estimate  is  sig- 
nificantls  more  biased  than  cither  the  D1S -Riasl  Ria<2  or  the  MSf)IS-/b<u/  Riax2  estimates  and 
that  its  confidence  intervals  are  at  least  an  order  of  magnitude  wider,  f  urthermore,  for  the  small 
run  length,  its  coverage  drops  significantly  below  90%.  In  fact,  there  were  no  system  failures  in  the 
runs  corresponding  to  the  40" n  of  the  confidence  intervals  which  did  not  contain  (  l  sing  our 
variance  reduction  techniques,  all  the  coverages  are  close  to  the  nominal  90%  value  except  for  the 
longest  run  using  MSOIS  which  had  a  coverage  of  only  81%. 

I  his  dip  in  coverage  concerned  us  since,  typically,  coverages  improve  with  run  length.  There  arc 
a  number  of  possible  explanations  for  this  phenomenon:  a  random  fluctuation,  numerical  errors  in 
(  or  j  l ),  rij),  problems  with  the  random  number  generator,  non-monotone  convergence  to 
normality,  problems  with  the  importance  sampling  scheme,  or  possibly  a  subtle  bug  in  the  com¬ 
puter  codes  (although  the  code  has  produced  valid  estimates  for  all  examples  tested).  We  recom¬ 
piled  the  numerical  solver  and  simulator  to  quadruple  precision  (from  double)  and  obtained 
essentially  identical  results.  We  then  changed  random  number  genet ators  from  the  multiplicative 
linear  congruential  generator  /,)+|  =  ( lnx 1 6807)  mod  2?l  -  1  to  the  combined  generator  described 
in  [  2b  ]  With  R  =  200  replications,  the  coverage  increased  to  85%  which  still  represents  a  statis¬ 
tically  significant  departure  from  90%  (although  its  confidence  interval  overlays  with  that  of  the 
81%  coverage).  Thus,  at  present,  and  despite  considerable  effort,  we  have  been  unable  to  identify 
the  source  of  this  slight  coverage  problem.  Since  our  importance  sampling  methods  are  designed 
to  move  towards  the  most  likely  failure  states,  it  may  be  that  we  are  missing  secondary  failure 
modes  which  would  become  significant  at  such  high  levels  of  precision.  Note  that  the  first  non-zero 
digit  in  (  is  in  the  sixth  decimal  place,  so  the  problem  is  occurring  in  the  eight  decimal  place, 
f  urthermore,  in  practice,  such  high  precision  is  probably  not  warranted  given  inaccuracies  in  model 
parameters  and  distributional  assumptions. 
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6.2.  Transient  Measures 


In  this  section  we  discuss  the  results  of  our  experiments  for  estimating  reliability  and  expected  in¬ 
terval  availability.  Recall  that  for  transient  measures  we  not  only  want  the  system  to  move  quickly 
towards  the  set  of  system  failed  states  /',  but  also  reach  there  before  the  observation  period  expires. 
Since  these  two  issues  are,  in  some  sense,  orthogonal,  we  use  the  same  technique  as  in  the  steady- 
state  case  to  bias  the  embedded  Markov  chain  towards  the  system  failed  set,  in  addition  to  another 
independent  technique  (e.g.,  forcing  or  conditioning  as  discussed  in  Section  5)  to  reduce  the  vari¬ 
ance  due  to  holding  times  in  the  various  states.  The  likelihood  ratios  corresponding  to  these  two 
aspects  of  simulation  arc  independent  and  can  be  formulated  as  in  Proposition  4.1.  The  goal  of  the 
simulation  is  to  study  the  effect  of  the  forcing  and  conditioning  techniques  We  considered  only 
the  balanced  system.  I  or  each  measure,  we  allowed  each  method  to  run  for  400. 000  events. 
Standard  simulation  was  not  considered  as  it  is  very  ineffective  for  estimating  transient  measures. 
The  results  are  given  in  Tables  4  and  5. 

I  or  all  methods,  we  notice  that  the  confidence  intervals  are  smaller  for  some  range  of  intermediate 
time  periods  and  wider  at  the  ends.  To  explain  this,  we  recognize  two  key  factors  affecting  the 
variance  of  the  estimates;  namely,  the  number  of  replications  in  a  simulation  run  and  the  value  of 
bias/  used  with  importance  sampling.  !:or  smaller  time  intervals,  there  are  more  replications  in  a 
simulation  run  than  for  larger  time  intervals  (since  we  kept  the  total  number  of  events  fixed).  I  bis 
contributes  to  a  larger  variance  for  larger  time  intervals,  furthermore,  for  each  time  interval,  there 
is  an  optimal  value  for  bias /  which  maximizes  the  variance  reduction.  While  bias/  =  0.5  may  be 
close  to  optimal  for  some  intermediate  range  of  time  intervals,  it  departs  from  the  optimal  value  for 
either  smaller  or  larger  time  intervals. 

Hie  two  tables  indicate  that  forcing  and  conditioning  arc  most  effective  for  short  time  intervals. 
This  is  intuitive  because  for  a  long  interval  enough  transitions  occur  before  the  interval  expires,  and 
therefore,  the  embedded  Markov  chain  has  a  chance  to  reach  the  system  failed  set  using  only  failure 
biasing  This  is  not  true  for  short  intervals,  and  therefore,  either  forcing  transitions  to  occur  before 
the  end  of  the  period  or  conditioning  the  holding  time  out  in  state  0  lias  a  significant  effect.  Roth 
forcing  and  conditioning  give  similar  results  for  unreliability,  while  conditioning  is  consistently 
better  for  the  interval  unavailability.  Note  that  for  interval  unavailability  we  arc  using  liquation  4.7 
with  conditioning,  but  not  with  forcing.  However,  forcing  can  be  similarly  combined  with 
liquation  4.7  to  possibly  yield  better  results  Also,  note  that  Bias  /  j  Bias2  method  is  consistently 
better  than  both  the  Bias/  and  the  Bias /I Balancing  methods.  This  is  consistent  with  a  similar  con¬ 
clusion  with  respect  to  the  steady-state  measures  in  a  balanced  system.  This  is  intuitively  reasonable 
because  these  methods  correspond  to  importance  sampling  in  the  embedded  Markov  chain;  this 
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sampling  is  independent  of  forcing  and  conditioning,  lor  unbalanced  systems,  we  expect  that 
Bunt  Balancing  will  be  consistently  better  than  the  other  two  methods;  this  is  supported  by  a  pre¬ 
liminary  empirical  investigation  in  progress. 
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7.  Summary  and  Directions  for  Future  Work 


In  this  paper,  we  have  developed  a  unified  framework  for  simulation  of  Markovian  models  of  highly 
dependable  systems.  Conventional  numerical  techniques  are  difficult  to  apply  to  this  class  of 
stochastic  models  because  of  the  fact  that  the  size  of  the  state  space  of  the  Markovian  motlel  in¬ 
creases  exponentially  with  the  number  of  components  in  the  system.  On  the  other  hand,  simulation 
algorithms  tend  to  be  relatively  insensitive  to  the  si/e  of  the  state  space  of  the  simulated  Markovian 
model,  both  in  terms  of  storage  and  computational  requirements.  However,  standard  simulation 
is  inefficient  in  our  setting  because  the  principle  focus  of  interest;  namely,  system  failures,  occur  so 
infrequently  in  highly  dependable  systems.  As  a  consequence,  few  system  failures,  if  any,  would 
be  observed  if  standard  simulation  methods  were  to  be  user!  in  our  problem  context. 

The  emphasis  in  this  paper  has  therefore  centered  on  applying  variance  reduction  techniques  to 
improve  the  efficiency  of  the  simulations  associated  with  Markovian  models  of  highly  dependable 
systems  We  have  reviewed  the  basic  theory  of  importance  sampling  in  several  elementary  problem 
settings  and  then  used  this  insight  to  develop  sampling  heuristics  for  the  complex  systems  of  interest 
here  Different  variants  of  these  ideas  were  developed  for  both  transient  and  steady-state 
dependability  measures.  In  addition,  we  have  "fine-tuned"  the  importance  sampling  techniques  to 
take  advantage  of  the  structure  of  highly  dependable  systems  which  are  cither  balanced  or  unbal¬ 
anced 

Our  work  has  also  shown  that  importance  sampling  may  be  fruitfully  applied  in  conjunction  with 
a  variance  reduction  method  known  as  conditioning.  The  basic  idea  here  was  to  observe  that  highly 
dependable  systems  spend  a  significant  fraction  of  time  in  the  state  in  which  all  components  are 
fully  operational.  Since  the  stochastic  behavior  of  the  time  spent  in  the  fully  operational  state  was 
easy  to  calculate  analytically,  this  permitted  us  to  effectively  integrate  out  the  randomness  in  our 
importance  sampling  estimators  due  to  the  holding  times  in  the  fully  operational  state. 

Our  empirical  investigation  showed  that  the  combined  variance  reduction  obtained  by  using  both 
conditioning  and  importance  sampling  is  typically  substantial.  In  fact,  in  all  of  our  experiments,  our 
methods  yielded  estimators  in  which  the  variance  was  decreased  by  several  orders  of  magnitude. 
Our  empirical  work  also  showed  that  the  confidence  intervals  associated  with  our  estimators  typi¬ 
cally  provided  acceptable  levels  of  coverage.  We  view  this  as  important,  since  the  scientific  repre¬ 
sentation  of  the  accuracy  of  a  simulation  estimator  is  usually  gauged  through  a  confidence  interval. 

A  number  of  possible  directions  for  future  research  present  themselves.  One  important  issue  relates 
to  the  fact  that  the  importance  sampling  heuristics  presented  in  this  paper  were  basically  developed 


7  Summary  and  Directions  for  Future  Work 


.71 


for  systems  in  which  system  dependability  is  achieved  principally  through  high  component  reli¬ 
ability.  However,  another  approach  to  obtaining  high  system  dependability  is  through  high  levels 
of  component  redundancy.  Importance  sampling  methods  appropriate  for  the  analysis  of  highly 
redundant  systems  differ  from  those  presented  here.  Such  techniques  would  likely  have  important 
ramifications  for  the  simulation  of  certain  highly  dependable  systems. 

A  second  important  research  area  involves  the  generalization  of  the  ideas  developed  in  this  paper 
to  stochastic  models  of  highly  dependable  systems  in  which  the  underlying  failure  and  repair  dis¬ 
tributions  are  non-exponential.  Since  the  resulting  stochastic  process  is  typically  no  longer  cither 
Markovian  or  regenerative,  many  of  the  ideas  presented  in  this  paper  cannot  be  implemented  di¬ 
rectly.  Nevertheless,  it  is  likely  that  the  failure  biasing  ideas  that  worked  successfully  in  the 
Markovian  setting  can  be  suitably  generalized  to  obtain  effective  sampling  heuristics  for  non- 
Markov  ian  models.  However,  it  appears  less  likely  that  the  conditioning  techniques  of  this  paper 
can  be  applied  to  non-Markovian  models;  our  conditioning  technique  depends  critically  on  the 
exponential  nature  of  the  holding  times  in  Markovian  models. 


7.  Summary  and  Directions  for  Future  Work 


72 


References 


[1]  Bobbin,  A.  and  Trivcdi.  K.  S.  (I486).  An  Aggregation  Technique  for  the  Transient  Analysis 
of  Stiff  Markov  Chains.  I  I'll  Transactions  on  Computers  (-35,  803-814. 

[2]  Bourieius,  W.G.,  Carter,  W.C.  and  Schneider,  PR.  (I960).  Reliability  Modeling  Techniques 
for  Self- Repairing  Computer  Systems.  Proceedings  of  ACM  National  Conference.  San 
T'raneisco,  California,  295-309. 

[3]  ('hung,  K..1  .  (1967).  Markov  Chains  With  Stationary  Transition  Probabilities,  Second  edi¬ 
tion.  Springer- Verlag,  New  York. 

[4]  C  onway,  A  T.  and  Goyal.  A.  (1987).  Monte  Carlo  Simulation  of  Computer  System 
Availability  Reliability  Models.  Proceedings  of  the  Seventeenth  Symposium  on  Fault-Tolerant 
Computing.  Pittsburgh,  Pennsylvania,  230-235. 

[5]  Costes.  A.,  Doucct,  J.I  .,  I  andrault.  C.  and  I,apric,  .1.(7.  (1981).  SI  R I  -  A  Program  for 
Dependability  evaluation  of  Complex  l  auit-  Tolerant  Computing  Systems.  Proceedings  of 
the  Flcvcnth  Symposium  on  Fault-Tolerant  Computing.  Portland,  Maine,  72-78. 

[6]  Cottrell,  M.,  1'ort,  J.C.  and  Malgouyres,  (i.  (1983).  I  arge  Deviations  and  Rare  Invents  in  the 
Studs  of  Stochastic  Algorithms.  IFFE  Transactions  on  Automatic  Control  AC-28,  907-920. 

[7]  Cramer,  II.  (1946).  Mathematical  Methods  of  Statistics.  Princeton  University  Press, 
Princeton,  New  Jersey. 

[8]  Crane,  M.A.  and  Igiehart,  D.l,.  (1975).  Simulating  Stable  Stochastic  Systems,  III:  Regener¬ 
ative  Processes  and  Discrete  I'vcnt  Simulations.  Operations  Research  23,  33-45. 

[9]  Dugan.  J.B.,  Trivcdi,  K.S.,  Smotherman,  M.K.  and  Cicist,  R.M.  (1986).  The  Hybrid  Auto¬ 
mated  Reliability  Predictor.  Journal  of  Guidance,  Control,  and  Dynamics.  9,  3,  319-331. 

[10]  l  ox,  H  I  .,  and  Glynn,  P.W.  (1986).  Discrete-  Time  C  onversion  for  Simulating  Semi-Markov 
Processes.  Operations  Research  Letters  5,  191-196. 

[11]  f  ox,  H  I  and  Glynn,  PAV.  (1988).  Discrete-  Time  Conversion  for  l'inite-IIorizon  Markov 
Processes.  Technical  Report  No.  790,  School  of  Operations  Research  and  Industrial  engi¬ 
neering,  Cornell  University,  Itacha,  New  York. 

[12]  (feist,  R.M.  and  Trivcdi,  K.S.  (1983).  Ultra-High  Reliability  Prediction  for  Tault-Tolcrant 
Computer  Systems.  I  FI:  F  Transactions  on  Computers  032,  1118-1127. 

[13]  Glynn,  P.W.  and  Hcidclbcrgcr,  P.  (1988).  Bias  Properties  of  Budget  Constrained  Monte 
Carlo  Simulations.  IBM  Research  Report  RC  13581,  Yorktown  Heights.  New  York.  To 
appear  in  Operations  Research. 

[14]  Glynn.  PAV.  and  Igiehart,  D.l  (1989).  Importance  sampling  for  Stochastic  Simulations. 
To  appear  in  Management  Science. 

[15]  Goyal,  A.,  (farter,  W.C.,  de  Souza  e  Silva,  I-,.,  I .avenberg,  S.S.,  and  Trivcdi,  K.S.  (1986).  The 
S'  stein  Availability  estimator.  Proceedings  of  the  Sixteenth  Symposium  on  Fault-Tolerant 
C  imputing.  Vienna,  Austria,  84-89. 

[16]  Goyal,  A.,  I avenberg,  S.S.,  and  'Trivcdi,  K.S.  (1987).  Probabilistic  Modeling  of  Computer 
System  Availability.  Annals  of  Operations  Research  8,  285-306. 

[17]  Goyal.  A.  and  I. avenberg,  S.S.  (1987).  Modeling  and  Analysis  of  Computer  System  Avail¬ 
ability.  IBM  Journal  of  Research  and  Development,  31  6,  651-664. 


s.t 


References 


[IS]  Goyal.  A.,  1  leidclbergcr.  I*,  and  Shahabuddin.  I*.  (1687).  Measure  Specific  Dynamic  Impor¬ 
tance  Sampling  for  Availability  Simulations.  1987  Winter  Simulation  Conference 
Proceedings.  A.  Thesen.  II.  Cirant  and  W  D.  Kelton  (eds).  Ill  I  Press.  A5 1 -A57. 

[  1 6]  Ciross,  1).  and  Miller,  D.R.  (1684).  I  he  Randomization  I  echniquc  as  a  Modeling  Pool  and 
Solution  Procedure  for  Transient  Markov  Processes.  Operations  Research  32,  343-361. 

[20]  Nammerslev,  J.M.  anti  Ifandscomb,  D.C.  (1664).  Monte  Carlo  Methods.  Methuen,  I  ondon. 

[21]  Hordijk.  A  ,  Iglchart,  D.I  .  and  Schassbcrger,  R.  (1676).  Discrete  Time  Methods  for  Simu¬ 
lating  Continuous  Time  Markov  (drains.  Adv.  Appl.  I'roh  8,  772-788. 

[22]  Kahn.  II.  and  Marshall.  AAV.  (1653).  Methods  of  Reducing  Sample  Size  in  Monte  Carlo 
Computations.  Journal  of  Operations  Research  Society  of  America  I,  5.  263-278 

[23]  Karlin.  S.  and  Taylor.  I  I.M.  ( 1675).  A  first  Course  in  Stochastic  Processes,  Second  edition. 
Academic  Press.  New  York. 

[24]  Knuth.  DT.  (1676).  Big  Omicron,  Big  Omega  and  Big  Theta.  SIC  ACT  Mews  (ACM)  8, 
2.  18-24. 

[25]  lavcnbcrg,  S.S.,  Moeller,  T.I  .  and  Sauer.  CM  I.  (1676).  Concomitant  Control  Variables 
Applied  to  the  Regenerative  Simulation  of  Queueing  Svstcms.  Operations  Research  27.  1, 
134-160. 

[26]  I  ecuver,  P.  (1688).  efficient  and  Portable  Combined  Random  Number  Generators.  Comm, 
of  the  ACM  31.  6.  742-774. 

[27]  I  ehinann,  PI  (1683).  Theory  of  Point  estimation.  John  Wiley  and  Sons,  Inc.,  New  York. 

[28]  l  ewis,  I  T  and  Bdhm.  T  .  (1684).  Monte  Carlo  Simulation  of  Markov  Unreliability  Models. 
Nuclear  engineering  and  Design  77,  46-62. 

[26]  l  iceaga,  C.A.  and  Sicwiorek,  D  P.  (1686).  Towards  Automatic  Markov  Reliability  Modeling 
of  Computer  Architectures.  NASA  Technical  Memorandum  89009. 

[30]  Makam,  S.V.,  Avizicnis,  A.  and  Grusas,  G.  (1682).  I’d. A  ARITS  82  User's  Guide.  Also 
Technical  Report  CSD-8208.30,  University  of  California  at  I  os  Angeles. 

[31]  Johnson,  Jr.  A.  M.  and  Malek,  M.  (1988).  Survey  of  Software  Tools  for  evaluating  Reli¬ 
ability,  Availability,  and  Serviceability.  A  CM  Computing  Surveys  20,  4.  227-266. 

[32]  Meyer,  J.I:  (1680).  On  evaluating  the  Pcrformability  of  Degradable  Computing  Systems. 
Iliee  Transactions  on  Computers  C-39,  8.  720-731. 

[33]  Miller.  R.G.  (1974).  The  Jackknife  -  A  Review.  Riomelrika  61 ,  1-15. 

[34]  Muntz,  R.R.,  dc  Souza  e  Silva,  II.  and  Goyal,  A.  ( 1686).  Bounding  Availability  of  Repairable 
Computer  Systems.  Proceedings  of  the  ACM  Sigmctrics  and  Performance' 89.  Berkeley, 
California. 

[35]  Nicola,  V  P.  (1686).  Pumping  in  Markov  Reward  Processes.  IBM  Research  Report  RC 
14716,  Yorktown  Heights,  New  York. 

L36]  R  oss,  S  M.  (1670).  Applied  Probability  Models  with  Optimization  Applications.  Ilolden-Day, 
San  Prancisco. 

[37]  Ross,  S.M.  and  Scheduler,  Z.  (1670).  Using  Simulation  to  estimate  Pirst  Passage  Distrib¬ 
ution.  Management  Science  31,  2,  224-234. 

[38]  Parckh,  S.  and  Walrand,  J.  (1686).  A  Quick  Simulation  Method  for  Pxccssivc  Backlogs  in 
Networks  of  Queues.  IPPP  Transaction  on  Automatic  Control  34.  1,  54-66. 


.14 


References 


F-W]  Sahncr.  R.A  anti  I  rivedi,  K.S.  (1087).  Performance  anti  Reliability  Analysis  l  sing  Acyclic 
Directed  Graphs.  H  i'!'  Transaction  on  Software  Engineering  SK-13,  10,  1 105- !  I  14. 

[40]  Sanders,  \V.  II.  and  Meyer,  .1.  I  .  (I486)  MITASAN:  A  Performability  (‘.valuation  Tool 
Based  on  Stochastic  Activity  Networks.  Proceedings  of  the  I9S6  Tad  Joint  Computer  Con¬ 
ference.  AI1FS,  New  York,  807-816. 

[41]  Shahabuddin,  I’..  Nicola,  V  I  .,  I Icidclbcrgcr,  I’.,  (loyal  A.  and  Glynn  I’.VV.  (1088).  Variance 
Reduction  in  Mean  lime  to  Failure  Simulations.  IQS  ft  Winter  Simulation  Conference  Pro¬ 
ceedings.  \1.A.  Abrams,  P.I..  Ilaigh  and  J.C.  Comfort  (eds.).  IITT  Press,  401-400. 

[42]  Siegmund,  D.  (1076).  Importance  Sampling  in  the  Monte  Carlo  Study  of  Sequential  Tests. 
The  Annals  of  Statistics  4.  673-684. 

[43]  Smith.  W.l  .  (1055).  Regenerative  Stochastic  Processes.  Proe.  Roy.  Soe.  A.  232,  6-31. 

[44]  Walrand,  .1.  (1087).  Quick  Simulation  of  Rare  1  vents  in  Queueing  Networks.  Proceedings 
of  the  Second  International  Workshop  on  Applied  Mathematics  and  Performance' Reliability 
Models  of  Computer: Communication  Systems.  G.  Jazeolla.  P..J.  Courtois  and  0  .1  Boxma 
(eds).  North  Holland  Publishing  Company,  Amsterdam,  275-286. 


Reference  s 


.15 


Appendix  \:  Estimating  the  Distribution  of  Interval  Availability 


We  will  find  it  more  convenient  to  estimate  the  distribution  of  the  time  in  the  set  of  failure  states, 
l- (?..v)  =  P{  1(1)  <  .v},  where  <  (/)  =  /'  1(}.  €/)e/.r.  Since  A(t)  =  I  -  f  (t)ji  we  have 
t(/.x)  =  P !••!(/)<  .v}  =  1  —  /'(/.(I  —  x)t).  I  o  derive  an  estimator  for  l  '(t,x),  write 


!'(/.. v)  =  r,(t.x)  +  l'2U.x)  ( A  I ) 


where  l  |(/..v)=  P{/  (?)  <  ,v,  )  ,41  1  and  l  2(?,.v)  =  I’{  (■’(/)<  .v.  l,e  /"].  Define  /,  =  6/  ,  and 

i  .  ‘  j=n 

()=  Lf '(Y*n  v 4 1 ) •  w''h  )'rPi  ;mtl  defined  previously.  Note  that  /?,•  =  <’■+/■)•  and 

./-a 

^  i  + 1  =  >'/  4  (  +  /'.  ('(insider  (  \U,x): 


r,(?„r)  =  i’!no<.r,  >^/i  =  Vi>{r(?)<.v.,\^r, ,V(o  =  /| 

/=() 

cy> 

=  5j’I <  .r.  .V^/'.  7}  <  /  <  7}+l }  (A. 2) 

/■=() 

=  ~  'V‘  Vi- 1  +  ^/-|  <  <  <  >'/  +  /bl  • 

/=n 

Now  if  A)  =  0,  then  =  / ? f.  If  A'(^F  and  A,-  0,  then  «,,(/  -  1)  =  «,,(/)  and  therefore  y4_ ,  =  y;.  In 

either  case,  by  conditioning  on  the  sequence  of  states  and  the  holding  times  in  the  non-zero  states, 
we  can  write 

QQ 

f  |(/,.Y)  =  1[  ,  <t,  X.4D  (  t'n„(i-\){'  ~  Pl- I  ■  Ad  ~  l'-nrfi)ll  ~  Pi- ■  (AA) 

/=(> 

Now  consider  F;2(?..v).  If  )',  e  /  and  N(t)  —  i.  then  l’(t)  =  1)  ,  4-  I  —  I)  =  (  -  y(_,  —  I  urt her- 

more.  on  this  set  «,,(/  -  1)  =  y(_|  =  y,-  and  (’_|  -  (’■  so  that 


1  2(,"v)  -  1  (F  1  SX.  X,  ft]  1  (I-  y,  ,  -  C,  ,  <x)  1  {y,  ,  +  <•:,  ,  +  /,  ,  <»-  y,  +  C,  +  /-,)  3 

/=(> 

=  y'.hr.  ,  <*,  -V,  e!  )  (  l'nn{i)(l  ~  1  i- 1  _  ^/-l  •  Ad  ~  n(i)(l  ~  (  i  ~  -*>.  Ad  )] 

/=() 


(A. 4) 


I  his  development  provides  a  computationally  attractive  way  to  estimate  (  '(?,. r)  without  sampling 
from  the  state  0  holding  time  distribution.  It  is  easily  combined  with  failure  biasing.  Practical 
implementations  may  require  truncation  or  stopping  rules  as  described  in  Section  4.2. 
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table  1  l  navailahility  and  M  i  l  l  I  stimatcs  and  Relative  IIW 
in  a  Balanced  System 


Numerical 

Direct 

Bias  I 
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Results 
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able  systems.  Since  the  failure  event  is  a  rare  event,  the  estimation  of  system  dependability  measures 
usuig  standard  simulation  requires  very  long  simulation  runs.  We  show  that  a  variance  reduction 
technique  called  Importance  Sampling  can  be  used  to  speed  up  the  simulation  by  many  orders  of 
magnitude  over  standard  simulation.  This  technique  can  be  combined  very  effectively  with  regen¬ 
erative  simulation  to  estimate  measures  such  as  steady-state  availability  and  mean  time  to  failure. 
Moreover,  it  can  be  combined  with  conditional  Monte  Carlo  methods  to  quickly  estimate  transient 
measures  such  as  reliability,  expected  interval  availability  and  the  distribution  of  interval  availability 
We  show  the  effectiveness  of  these  methods  by  using  them  to  simulate  large  dependability  models. 
We  also  discuss  how  these  methods  can  be  implemented  in  a  software  package  to  compute  both 
transient  and  steady-state  measures  simultaneously  from  the  same  sample  run. 
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